Codecademy Logo

Transformers: the "T" in GPT

What’s a transformer?

GPT stands for “Generative Pre-trained Transformer”. Transformers are a specific type of neural network architecture developed in 2017 that the model uses in implementing pretraining and finetuning.

Pretraining and Finetuning

  • Pretraining is the act of training a model from scratch using training data. The weights of a model are randomly initialized, and the training starts without any prior knowledge and updated during the process.
  • Finetuning is usually done with a smaller amount of data: the knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning.

Transformer Architecture

  • The original architecture of a transformer, as proposed in 2017, contains two blocks — an encoder and decoder block. Since then, transformer models have been built that are exclusively decoder-only or encoder-only as well.
  • Encoders and decoders are both neural networks that have special layers known as attention layers. These layers use something known as a “self-attention mechanism” to capture contextual information in sequences. Their primary difference is in how they implement the attention mechanism.
Image showing the primary components of a transformer - the encoder and decoder.

Transformers are data agnostic

Transformer models are data agnostic, i.e., they can work with different training data types such as text, image, video, audio and protein sequences.

Self-attention mechanism

  • The masking mechanism in encoders is random. Attention layers thus have bidirectionality, i.e. access to both sides of a masked token.
  • In decoders, the masking mechanism during the training process involves masking everything including and after the token to be predicted. This makes them great at tasks that involve sequence completion like text generation.

Transformer model sizes

The size of transformer models have grown exponentially since their inception in 2017 as their performance scales with the size of training data. The earlier models were of the order of a few billion parameters and current models are 100 times bigger.

Image showing how the number of parameters in transformers have changed from less than a billion in 2017 to hundreds of billions in 2023.

Types of Transformers

Transformer models can be grouped into three main categories based on their architecture:

  • Auto-encoding (or encoder-only) models like BERT that are great at sentence classification, named entity recognition and extractive question-answering.
  • Auto-regressive (or decoder-only) models like GPT that are great at text generation
  • Sequence-to-sequence (or encoder-decoder) models like BART and T5 that are suitable for summarization and translation.
Image describing the three different transformer types — auto-encoding, auto-regressive and sequence-to-sequence models.

Tradtional Language Models vs LLMs

Traditional language models were count-based and struggled to capture long-range dependencies in text. Neural language models based on RNNs and LSTMs were the first step towards solving this prior to transformers.

Learn More on Codecademy