Codecademy Logo

Exploring Transformers with Hugging Face

Hugging Face’s Transformers Library

  • The goal of the Hugging Face Transformers library is to provide a single Python API through which any transformer model can be loaded, trained, fine-tuned and saved.
  • The Hugging Face Transformers library provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. It’s backed by the three most popular deep learning libraries – JAX, PyTorch and TensorFlow.

The pipeline() function

  • The most basic object in the Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer.
  • Some of the text-related tasks available in the pipeline() function are feature extraction, named entity recognition, sentiment analysis, summarization and text generation.
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier(
[
sample_text_sequence_1,
sample_text_sequence_2,
]
)

Model Hub

  • A model is a general term that can mean either architecture or checkpoint. Architecture refers to the specific neural network configuration uses and checkpoints are the weights for a given architecture. For example, BERT is an architecture, while bert-base-uncased is a checkpoint.
  • The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k demo apps, all open source and publicly available, in an online platform from which models can be downloaded from or uploaded to.

Model Cards

  • Model cards are markdown files that accompany the transformer models, provide handy information and are essential for discoverability, reproducibility, and sharing in AI research.
  • Model cards contain additional metadata about the model including and not limited to the model weights, training parameters, training datasets, model performance evaluation results and its intended uses and potential limitations.

Tokenizers

  • Tokenization is an essential step in preprocessing text in Natural Language Processing (NLP) tasks. Tokenizers break down a stream of textual data into words, subwords, or symbols (like punctuation) that can then be converted into numbers or vectors to be processed by algorithms. 
  • A tokenizer in a transformer model is responsible for splitting the input text into tokens, mapping each token to an integer and adding additional inputs that may be useful to the model.

The from_pretrained() method

The from_pretrained() method can be used to load and save a pretrained transformer model. The AutoTokenizer, AutoProcessor and AutoModel classes allow one to load tokenizers, processors and models respectively for any model architecture.

from transformers import AutoModel, AutoTokenizer
checkpoint = 'pretrained-model-you-want'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)

Token selection strategies

Decoder models employ various strategies in next token generation, which can be adjusted by the user. These include n-gram penalties, which prevent token sequences of n length from repeating; sampling, which chooses the next token at random from among a collection of likely next tokens; and temperature, which adjusts the predictability of the randomly selected next token, with higher temperatures producing less predictable output.

When decoder models simply select the most probable next token for their output, it’s called “greedy search.” When the model projects several tokens further into the potential output and chooses the most probable multi-token sequence from among several candidates, it’s known as “beam search.”

Learn More on Codecademy