Articles

Transformer Architecture Explained With Self-Attention Mechanism

Transformers are deep learning models that help the large language models (LLMs) understand the contextual meaning of text inputs and generate relevant text outputs. In this article, we’ll discuss how the transformer architecture works, focusing on the self-attention mechanism that makes these models powerful at understanding context and generating relevant responses.

  • Learn about what transformers are (the T of GPT) and how to work with them using Hugging Face libraries
    • Intermediate.
      3 hours
  • Learn how to build a Generative Pre-trained Transformer (GPT) from scratch using PyTorch.
    • With Certificate
    • Intermediate.
      2 hours

What is a transformer in generative AI?

In generative AI, a transformer is a neural network architecture designed to process and generate text or sequential data. Most LLM applications, including ChatGPT, Gemini, and Claude, are implemented using the transformer architecture.

We use self-supervised learning and process web-scale datasets to build transformer models. During training, the transformer model captures the statistical understanding of the text in the dataset. For example, the model learns that the probability of the term “book” coming after “reading” is 0.48 and the probability of the term “blog” coming after “reading” is 0.40. On the contrary, the probability of the term “banana” coming after “reading” might be 0.00000001. Along with the statistical understanding, the model also learns the contextual meaning of the text data using the self-attention mechanism. This helps the transformer model understand the semantic meaning of a given query and generate relevant outputs.

To understand how transformer models work, let’s discuss their components, starting with the transformer architecture diagram.

Transformer architecture overview

The first transformer architecture was introduced in the paper “Attention Is All You Need. It was designed to translate text from one language to another and looks as follows:

Image showing the transformer architecture diagram

A transformer architecture has different components, such as input embedding, positional encoding, a multi-head attention layer, an Add & Norm layer, a feed-forward neural network, a masked multi-head attention layer, a linear neural network, and a softmax layer. Let’s discuss each of these components individually to understand the transformer architecture better.

Understanding different components in the transformer architecture

A transformer model can be broadly divided into two parts: encoder and decoder. The encoder and decoder comprise different components such as multi-head attention, feed-forward neural network, add-and-norm layer, residual connections, etc. Outside the encoder and decoder, we also have input embedding, positional encoding, a softmax layer, and a linear neural network layer.

Input embedding

Any deep learning model only understands numbers. So, the model converts the text in the training data or the user input into embedding vectors. To do this, all the words in the input training data are converted into tokens, which are subword units. For example, reading can be converted into read and ing. All these tokens are then converted to vector embeddings using a neural network model. The neural network model captures the semantic and syntactic relationships between the tokens in the training data and provides a vector embedding for every token.

Once we get the vector embedding, we create a vocabulary to store the token IDs, tokens, and vector embeddings. You can generate vector embeddings for words in a text dataset using models like Google Word2Vec, FastText, or GloVe.

Positional encoding

Unlike recurrent neural networks or convolutional neural networks, which process input data sequentially, a transformer model processes all the tokens in the input text in parallel. Hence, positional encoding adds identifiers to the vector embeddings of the input tokens to denote their position in the input text.

For instance, consider these sentences where word order changes meaning:

  • The cat chased the dog.
  • The dog chased the cat.

The words are identical, but their positions create completely different meanings. Positional encoding ensures transformers understand these crucial position-based differences.

Static embeddings with positional encoding aren’t always sufficient for the model to understand the user input. For instance, consider the following sentences:

  • I have a golden watch.
  • Let’s watch a movie tonight.

In these sentences, the word “watch” has completely different meanings. However, the static embedding for “watch” will be the same in both sentences. To capture the meaning of a token in a sentence, we generate its contextual embedding using the encoder. The encoder consists of multi-head attention, a feed-forward neural network, an add & norm layer, and residual connections.

Multi-head attention layer

A multi-head attention or multi-head self-attention layer identifies how each token affects (attends to) other tokens in the input. It uses multiple attention heads that allow the transformer to focus on syntactic, semantic, and positional relationships between the tokens. Each attention head generates context-aware embeddings of the input tokens using three matrices Wk, Wv, and Wq, which the transformer model learns during the training process.

  • Wk knows how to encode the key vector (K) of a token for computing attention.
  • Wv knows how to encode the value vector (V) of a token for computing attention.
  • Wq knows how to encode the query vector (Q) of a token for attention computation.

The context-aware embeddings generated from different attention heads are computed using the self-attention mechanism and then added to get the contextual embedding of a token. In the following sections, we will discuss the key, value, and query vectors along with the self-attention mechanism in detail.

Feed forward layer

A feed-forward layer is a two-layer fully-connected neural network layer that enriches the embedding vectors by applying non-linear transformations.

Add & norm layers

We use add & norm layers after the multi-head attention and feed-forward layers. The add & norm layers add any given layer’s output and input using a residual connection. It helps with smooth gradient flow and preserves the original information while allowing the transformer model to learn the patterns in the data. After adding the input and output, the add & norm layer performs layer normalization that helps stabilize the training process by keeping the values in the embedding vectors in a consistent scale.

Masked multi-head attention

The masked multi-head attention is only present in decoders. It is similar to the encoder’s multi-head attention layer, but with a mask. The mask ensures the decoder cannot look at future tokens while predicting any token.

Multi-head cross attention

The multi-head attention layer in the decoder takes the key and value vectors of the inputs from the encoder. Hence, it is also called a multi-head cross-attention layer.

Linear neural network layer

The linear neural network layer maps the decoder’s outputs from the embedding dimension to the vocabulary dimension. This gives us a vector of logits for every possible word in the vocabulary for each token position.

Softmax layer

The softmax layer converts the logits from the linear neural network layer into probabilities, and we get a probability distribution over the vocabulary for each position in the output. Finally, the transformer model generates the output text using the probabilities and different decoding strategies, such as greedy decoding and beam search.

All these components stitched together create the encoder and decoder. Let’s discuss how the encoder and decoder work in a transformer.

How encoders work in a transformer model?

The encoder in a transformer model generates the contextual embedding of an input text sequence. It uses the following steps to generate the contextual embedding:

  • The self-attention layer calculates the attention score of each token towards every other token using the self-attention mechanism.
  • The feed-forward layer enriches the self-attention layer’s output by performing non-linear operations.
  • The add & normal layers ensure smooth gradient flow and normalized values while passing the embedding vectors from one layer to another.

During training, the loss is backpropagated, and the weights of the embedding, attention, and feed-forward layers are updated accordingly. After training, the attention layers in the encoder finalize the three matrices Wk, Wv, and Wq, which they use during inference to generate contextual embedding of input queries.

How decoders work in a transformer model?

The decoder predicts the next word in a given sequence using contextual embeddings from the encoder. To understand how it works, suppose we want to train a transformer model to generate the sentence I love Codecademy. We shift the sentence to the right by one position by prepending a start token to do this.

Target: [I, love, Codecademy]
Shifted values: [<start>, I, love, Codecademy, <end>]

The masked multi-head self-attention layer ensures that the decoder can only see the past and the current tokens. This ensures causal and autoregressive training. For example, inputs and outputs for the decoder while training to predict the words in the sentence I love Codecademy will look as follows:

Pass 1:
Input to decoder: [<start>]
Prediction: [<start>, I]
Pass 2:
Input to decoder: [<start>, I]
Prediction: [<start>, I, love]
Pass 3:
Input to decoder: [<start>, I, love]
Prediction:[<start>, I, love, Codecademy]
Pass 4:
Input to decoder: [<start>, I, love, Codecademy]
Prediction: [<start>, I, love, Codecademy, <end>]

These passes seem sequential, but the training runs in parallel, and all the tokens are predicted simultaneously as the whole target sequence is available.

In contrast to training, the decoder works sequentially during inference. It begins with the <start> token. At any instance, it uses the sequence of already generated tokens as its input and predicts the next token using contextual embeddings from the encoder. The decoder then appends the predicted token to the previous input and predicts the next token until it generates the <end> token.

As we have mentioned multiple times, transformer models generate contextual embeddings and predict the next word in a sequence using the self-attention mechanism. Let’s discuss how the self-attention layers in a transformer model work to generate the contextual embeddings.

The self-attention mechanism

Self-attention is a mechanism where each token in the input pays attention to all other tokens, including itself, to generate its contextual embedding. Calculating attention is a way for each token to ask, “Which other words should I focus on to understand my meaning?”

To calculate attention, every token needs three vectors: the query vector (Q), the key vector (K), and the value vector (V).

The QKV vectors

The query (Q), key (K), and value (V) vectors are also sometimes referred to as QKV vectors.

  • A token’s query vector (Q) represents the current token’s query for what features the current token is looking for in other tokens. Each token in the data generates a query vector to scan the key vectors of other tokens to decide the relevant tokens.
  • A token’s key vector (K) helps other tokens decide how relevant the current token is to them. It encodes features of a token that are useful for matching against queries of other tokens. The dot product of the query vector of a token with the key vector of another token gives us a similarity score showing the relevance of the tokens to each other.
  • A token’s value vector (V) contains the actual information that a token contributes in the text.

We calculate the Q, K, and V vectors for a token using the embedding vector X and the matrices Wk, Wv, and Wq as follows:

Q=XWqQ = X \cdot W_q
K=XWkK = X \cdot W_k
V=XWvV = X \cdot W_v

After calculating the QKV vectors for a token, the self-attention layer calculates how much attention a particular token pays to other tokens.

Attention calculation

To calculate how much a token i attends to another token j, the transformer model computes the dot product of the query vector of the token i with the key vector of the token j:

attentionScorei,j=QiKj\text{attentionScore}_{i,j} = Q_i \cdot K_j

To scale the attention score, we divide it by √dk, where dk is the length of the embedding vectors. Then we apply the softmax function to scale the values between 0 and 1.

weighti,j=softmax(QiKjdk)\text{weight}_{i,j}= softmax(\frac{Q_i \cdot K_j}{\sqrt{d_k}})

Here, weighti,j is the normalized weight for how much attention token i pays to the token j. After this, we calculate the dot product between weighti,j and the value vector Vj to get the weighted contribution of token j’s value to the final contextual embedding of token i.

weighti,j=softmax(QiKjdk)Vj\text{weight}_{i,j} = softmax(\frac{Q_i \cdot K_j}{\sqrt{d_k}}) \cdot V_j

Finally, we calculate the sum of weighti,j for all the other tokens to get the context-aware embedding for the token i.

contextualEmbeddingi=jweighti,jVj\text{contextualEmbedding}_i = \sum_{j} {weight}_{i,j} \cdot V_j

This way, the attention layer calculates the contextual embedding of each token i in the input. Practically, the attention scores are computed using matrices.

Attention calculation in the matrix format

In the matrix format, the model creates the Q, K, and V matrices as follows:

  • Each row in the matrix Q represents Qi, i.e., query vector for token i.
  • Each row in the matrix K represents Ki, i.e., the key vector for the token i.
  • Each row in the matrix V represents Vi, i.e., value vector for the token i.

The self-attention layer calculates the attention scores of tokens towards each other using the Q and K matrices as follows:

attentionScores=QKT\text{attentionScores} = Q \cdot K^T

To scale the values, we divide the attention score matrix by √dk and pass the output to the softmax function.

attentionScores=softmax(QKTdk)\text{attentionScores} = softmax(\frac{Q \cdot K^T}{\sqrt{d_k}})

Finally, we calculate the dot product of the attention score matrix with the value matrix V to obtain the contextual embeddings of the input tokens.

contextualEmbeddings=Attention(Q,K,V)=softmax(QKTdk)V\text{contextualEmbeddings} = \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right) \cdot V

A transformer model has multiple attention heads, each of which calculates the contextual embeddings of the tokens independently. The final contextual embedding of a token is obtained by adding the outputs from all the attention layers.

The pre-training process helps the transformer model learn how to generate contextual embeddings and predict text. However, it isn’t always sufficient for day-to-day tasks like summarization, paraphrasing, or question-answering. Hence, we use supervised learning methods like transfer learning and reinforcement learning to train the transformer models for specific tasks.

Different transformer models and their variations

A transformer can have an encoder, a decoder, or both. Based on this, we can categorize the transformers into encoder-only, decoder-only, and encoder-decoder transformers.

Encoder-only transformers

Encoder-only transformers generate contextual embeddings for text data. Examples include BERT, DistilBERT, RoBERTa, and ELECTRA. We use encoder-only transformer models for text classification, named entity recognition, semantic search, information retrieval, and clustering applications.

Decoder only transformers

Decoder-only transformers use only the decoder stack of the original transformer architecture. Examples of decoder-only transformer models include GPT models, LLaMA, Mistral, Falcon, and StarCoder. Decoder-only transformers are excellent at text generation tasks, and we use them in creative writing, code generation, and chatbot applications.

Encoder-decoder transformers

Encoder-decoder transformers have both the encoder and decoder. Examples of encoder-decoder transformers include the original transformer model, Bidirectional and Auto-Regressive Transformers(BART), Text-to-Text Transfer Transformer (T5), MarianMT, ProphetNet, and Pegasus. These models are best for translation, summarization, paraphrasing, and other text-generation tasks.

Conclusion

Transformers have completely changed how deep learning models understand and process text data. Starting with the first transformer model, which was designed to translate text from one language to another, transformer models have developed to be able to talk to people, generate images and videos, and perform day-to-day tasks. In this article, we discussed the transformer architecture, including the different components of a transformer and the self-attention mechanism. We also discussed the different types of transformer models and their examples.

To learn more about transformer models, you can go through the Intro to AI Transformers course that discusses how to work with transformers using Hugging Face. You might also like Learn How To Build Your Own GPT course that uses PyTorch to build a GPT model from scratch.

Frequently asked questions

1. What is Nx in a transformer?

The term Nx denotes that the transformer has N blocks of encoders and N blocks of decoders. The original transformer architecture has six encoder and six decoder blocks.

2. What is layer normalization in transformers?

Layer normalization in transformers is a stabilization technique for stabilizing gradients and speeding up training. It prevents the problem of exploding or vanishing gradients and helps the model converge quickly.

3. What is self-attention in transformers?

Self-attention is a mechanism a transformer model uses to understand the meaning of a token based on the other tokens in the input. It helps the transformer model generate contextual embeddings of all the tokens in the input.

4. What is QKV in Transformers?

QKV is an acronym for Query (Q), Key (K), and Value (V) vectors in a transformer model. The transformer generates the QKV vectors for an input token using the static embedding of a token and the Wq, Wk, and Wv matrices.

5. Is ChatGPT a transformer model?

Yes. ChatGPT models such as GPT, GPT-2, GPT-3, GPT-4, and GPT-5 are decoder-only transformer models built for autoregressive text generation tasks.

6. Is BERT a transformer model?

Yes. Bidirectional Encoder Representations from Transformers (BERT) is an encoder-only transformer model built to generate contextual embeddings of text inputs.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy

  • Learn about what transformers are (the T of GPT) and how to work with them using Hugging Face libraries
    • Intermediate.
      3 hours
  • Learn how to build a Generative Pre-trained Transformer (GPT) from scratch using PyTorch.
    • With Certificate
    • Intermediate.
      2 hours
  • Learn to build production-ready neural networks with PyTorch, including finetuning transformers, in this hands-on path.
    • Includes 6 Courses
    • With Certificate
    • Intermediate.
      8 hours