Codecademy Logo

Introduction to Deep Learning for NLP

Scalars, Vectors, and Matrices

Scalars, vectors, and matrices are fundamental structures of linear algebra, and understanding them is integral to unlock the concepts of deep learning.

  • A scalar is a singular quantity like a number.
  • A vector is an array of numbers (scalar values).
  • A matrix is a grid of information with rows and columns.

They can all be represented in Python using the NumPy library.

import numpy
# Scalar code example
x = 5
# Vector code example
x = numpy.array([1, 2, 3, 4])
# Matrix code example
x = numpy.array([1, 2, 3], [4, 5, 6], [7, 8, 9])

Tensors

A tensor is a multidimensional array and is a generalized version of a vector and matrix. It is the fundamental data structure used in deep learning models.

import numpy as np
# A 0-D tensor would be a scalar
x = 5
# A 1-D tensor would be a vector
x = numpy.array([1, 2, 3])
#A 2-D tensor would be a matrix
x = numpy.array([[1, 2], [3, 4], [5, 6]])
# A 3-D tensor would be a "cube" or a "stack" of matrices
x = numpy.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

Neural Network Concept Overview

Neural networks are made up of multiple layers of computational units called nodes. The different types of layers an input layer can have are:

  • Input Layer: The layer where inputs enter the neural network. There are as many nodes in the input layer as there are features in the input data point.
  • Hidden layer: A layer that comes between the input layer and the output layer. They introduce complexity into a neural network and help with the learning process. One can have as many hidden layers as you want in a neural network (including zero).
  • Output Layer: The final layer in our neural network. It produces the final result, so every neural network must have only one output layer.

These nodes are connected to each other via weights, which are the learning parameters of a deep learning model, determining the strength of each linked node’s connection. The weighted sum between nodes and weights is calculated with the following formula:

weighted_sum=(inputsweight_transpose)+bias_nodeweighted\_sum = (inputs \cdot weight\_transpose) + bias\_node
This is a neural network with no hidden layers. It contains sections marked "Inputs,", "Weighted Sum,", "Activation Function," and "Output."

Activation Functions and Forward Propagation

Activation functions are applied to the weighted sums between weights and connected nodes, and they determine what is fired to a neuron in the next layer of the neural network. Two common activation functions are ReLU and sigmoid. An example of the sigmoid function is shown in the image.

The bias node shifts the activation function either left or right to create the best fit for the given data.

Activation(weighted_sum+bias_node)Activation(weighted\_sum + bias\_node)

This process of firing data through each layer of a neural network is called forward propagation.

A sigmoid function graph: A sigmoid function has an logistic growth shape, the graph shown is between the values of -10 and 10. The formula for a sigmoid function is f(x) = 1/(1+e^(-x))

Gradient Descent and Backpropagation

After inputs have gone through a neural network one time, predictions are unlikely to be accurate. Thus, gradient descent (GD) is used to update the weight parameters between neurons to iteratively minimize the loss function and increase the accuracy of the learning model. The process of calculating these gradients is known as backpropagation.

When viewing the GD concept graphically, you will see that it looks for the minimum point of our loss function. Starting at a random point on the loss function, gradient descent will take “steps” in the “downhill direction” towards the negative gradient. The size of the “step” taken is dependent on the learning rate. Choosing an optimal learning rate affects both the accuracy and efficiency of the learning model’s results, as shown in the gif. The formula used with learning rate to update our weight parameters is the following:

parameter_new=parameter_old+learning_rategradient(loss_function(parameter_old))\begin{aligned} parameter\_new=parameter\_old+learning\_rate \cdot \\ gradient(loss\_function(parameter\_old)) \\ \end{aligned}
This gif shows some possible issues one might run into when choosing a learning rate. The graph on the left depicts a learning rate that is too small, so the optimum value is found at an inefficient rate. The graph in the middle shows depicts a point being scattered all around a graph with the optimal loss value never being found because of a learning rate that is too high. The gif on the far right shows an optimal learning rate where an optimum value is found efficiently and accurately.

Loss Functions

After data has been sent through a neural network, the error between the predicted values and actual values of the training data is calculated using a loss function. There are two commonly used loss calculation formulas:

  • Mean squared error which is used for regression problems — the gif above shows how mean squared error is calculated for a line of best fit in linear regression.
  • Cross-entropy loss, which is used for classification learning models rather than regression.
A visual that shows the squared distance of each point from a line of best fit. The formula for this is (predicted value - actual value)^

Variations of Gradient Descent

Deep learning models are often trained on extremely large datasets. For the sake of efficiency, variations of gradient descent that do not iterate through all data points one at a time are almost exclusively used. The most common ones used are:

  • Stochastic Gradient Descent (SGD): A variant of gradient descent where instead of using all data points to update weight parameters, a random data point is selected.
  • Adam Optimization: A variant of SGD that allows for adaptive learning rates.
  • Mini-batch Gradient Descent: A variant of gradient descent that uses random batches of data to update parameters instead of a random data point.
A diagram that shows the difference in performance between ADAM, stochastic gradient descent, and gradient descent for a particular learning model. Adam performs the best over time, followed by stochastic gradient descent.

Scalar Multiplication with Matrices

In scalar multiplication, we multiply every entry of the matrix by the scalar value.

Scalar multiplication being performed in a gif that shows a scalar value being multiplied by each element in a 2x2 matrix. For example, the number in the top left of our 2x2 matrix is 4. When it is multiplied by our scalar value, 2, we end up with the number 8 in our resulting 2x2 matrix.

Matrix Addition

In matrix addition, corresponding matrix entries are added together.

Matrix addition being performed between two 2x2 matrices. Each element in the same position is added together. For example, if we add the numbers in the top left corner of each matrix being added — 3 in the first matrix, and 4 in the second matrix — we get a result of 7 in the top left corner of our new 2x2 matrix.

Matrix Multiplication

In matrix multiplication, the number of rows of the first matrix must be equal to the number of columns of the second matrix. The dot product between each row and column of the matrix and placed as an entry into the resulting matrix as shown in the image.

This animation shows the process of matrix multiplication between a 2x3 matrix and a 3x2 matrix. In this first , the dot product between the first row of the left matrix and the first column of the right matrix is performed. The first row contains the numbers (1 2 3) and the first column contains the numbers (7 9 11). The dot product between (1 2 3) and (7 9 11) is equal to 1*7 + 2*9 + 3*11 = 58. 58 is placed as the top left value of the resulting 2x2 matrix.

Matrix Transpose

A matrix transpose switches the rows and columns of a matrix.

A gif that shows each row of a matrix being converted into columns. For example, the first row in our original 3x2 matrix is (6, 4, 24). When we perform the transpose, the first column of our resulting 2x3 matrix is (6, 4, 24).

Learn More on Codecademy