Scalars, vectors, and matrices are fundamental structures of linear algebra, and understanding them is integral to unlock the concepts of deep learning.
They can all be represented in Python using the NumPy library.
import numpy# Scalar code examplex = 5# Vector code examplex = numpy.array([1, 2, 3, 4])# Matrix code examplex = numpy.array([1, 2, 3], [4, 5, 6], [7, 8, 9])
A tensor is a multidimensional array and is a generalized version of a vector and matrix. It is the fundamental data structure used in deep learning models.
import numpy as np# A 0-D tensor would be a scalarx = 5# A 1-D tensor would be a vectorx = numpy.array([1, 2, 3])#A 2-D tensor would be a matrixx = numpy.array([[1, 2], [3, 4], [5, 6]])# A 3-D tensor would be a "cube" or a "stack" of matricesx = numpy.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
Neural networks are made up of multiple layers of computational units called nodes. The different types of layers an input layer can have are:
These nodes are connected to each other via weights, which are the learning parameters of a deep learning model, determining the strength of each linked node’s connection. The weighted sum between nodes and weights is calculated with the following formula:
Activation functions are applied to the weighted sums between weights and connected nodes, and they determine what is fired to a neuron in the next layer of the neural network. Two common activation functions are ReLU and sigmoid. An example of the sigmoid function is shown in the image.
The bias node shifts the activation function either left or right to create the best fit for the given data.
This process of firing data through each layer of a neural network is called forward propagation.
After inputs have gone through a neural network one time, predictions are unlikely to be accurate. Thus, gradient descent (GD) is used to update the weight parameters between neurons to iteratively minimize the loss function and increase the accuracy of the learning model. The process of calculating these gradients is known as backpropagation.
When viewing the GD concept graphically, you will see that it looks for the minimum point of our loss function. Starting at a random point on the loss function, gradient descent will take “steps” in the “downhill direction” towards the negative gradient. The size of the “step” taken is dependent on the learning rate. Choosing an optimal learning rate affects both the accuracy and efficiency of the learning model’s results, as shown in the gif. The formula used with learning rate to update our weight parameters is the following:
After data has been sent through a neural network, the error between the predicted values and actual values of the training data is calculated using a loss function. There are two commonly used loss calculation formulas:
Deep learning models are often trained on extremely large datasets. For the sake of efficiency, variations of gradient descent that do not iterate through all data points one at a time are almost exclusively used. The most common ones used are:
In scalar multiplication, we multiply every entry of the matrix by the scalar value.
In matrix addition, corresponding matrix entries are added together.
In matrix multiplication, the number of rows of the first matrix must be equal to the number of columns of the second matrix. The dot product between each row and column of the matrix and placed as an entry into the resulting matrix as shown in the image.
A matrix transpose switches the rows and columns of a matrix.