Getting Started With Hugging Face

Codecademy Team
An introduction to one of the most popular new ML education and model-building resources.

What is Hugging Face?

Hugging Face is a new open-source platform for working with AI. It offers a range of resources that include repositories of ML models, datasets, demo apps, and tools that abstract away many of the details required in configuring and preparing these resources.

The ability to procure open-source LLMs represents a new stage of accessibility in the evolution of ML tools. Now, there is an ecosystem of services that grants anyone with some familiarity the opportunity to evaluate and run high-performing models for their bespoke applications.

The crown jewel of the Hugging Face platform is the Models repo, containing thousands of models of varying sizes and applications that are free to use. Included in this archive are high-profile performant models such as GPT-2 and BERT. There are models trained for specific tasks, such as text generation or translation, ready to go out of the box. Optionally, for a more tailored application, a model can be fine-tuned with custom data to meet a given use case.

Hugging Face Models Repo

Complementing Models is the Datasets repo, currently home to almost 100k unique datasets. It has training data available for a range of tasks for different classes of ML including multimodal applications, computer vision, NLP, and audio.

In addition to these repositories, Hugging Face has developed libraries that simplify and condense much of the work required in putting a model into action. The transformers library, for example, provides methods for tokenizing data that automatically account for how the selected model was trained.

Finally, the Hub is a repository for saving and sharing models. These can simply be copies (forks) of existing models, fine-tuned models, or completely new from the ground up. The Hub represents an approachable option for managing and sharing model variants.

In this tutorial, we’ll walk through a generic workflow for running an open-source model. It will entail downloading a model; training with data from one of the many datasets on Hugging Face; saving the trained model to the Hub repo, and conclude by calling on the model to make a prediction.

The workflow will leverage many of the handy methods provided by Hugging Face to quickly prep and instantiate a model. And the whole effort will be executed in the browser within a Google Colab notebook.

Getting Started

Hugging Face is free to use and creating an account only requires an email address. In many ways, the platform is analogous to GitHub in its function as well as its approach: all the main features are free and open to the public without limits. Anyone can create and upload as many models as they want at no additional cost.

The workflow shown in this tutorial saves the trained model to the Hub repo. The only additional (account) configuration necessary is the creation of a key that will provide access to a user profile from the notebook environment.

Keys can be managed under the profile/settings page.

Note: The key must be “write” enabled, otherwise an error will be thrown.

Hugging Face Account Settings

The Colab Notebook: Kickoff

This project will be completed entirely within a Google Colab notebook, an online coding environment similar to Jupyter. It allows for the convenient allocation of compute resources, libraries, and data all in the browser instead of downloading these requirements to a local machine and setting up a dedicated VM. Additionally, Colab has been adopted by Hugging Face as part of their standard documentation. Many of the guides and other documentation on Hugging Face are accompanied by prepackaged Colab notebooks ready-to-run, configured in PyTorch, or TensorFlow.

Hugging Face Tutorial For Q and A

The first step is to import all the required libraries. Colab notebooks have many popular libraries preloaded, but often you may need to begin by installing packages via pip. The first cell will make the following call:

Hugging Face Libraries

In this cell, we call five different Hugging Face libraries:

  • Transformers: Methods for preparing models and data, as well as accessing APIs.
  • Datasets: Provides tools for creating and accessing datasets.
  • Evaluate: Provides a range of metrics for monitoring and assessing the training process.
  • Accelerate: Supports efficient processing in the model training phase.

With these libraries installed we can move on to loading the data, an archive of over 8k training examples from the Rotten Tomatoes film and TV review site.

Rotten Tomatoes Dataset

The rt variable in this example is a dictionary that contains the predefined splits as seen in the following cell:

Dataset Key Names

We can then review a given instance by calling a key and index:

Data Instance

Based on this excerpt, we can see that the dataset is composed of objects that have the text of the review, and a label (0 for negative and 1 for positive).

The only additional prerequisite we’ll address here is connecting to Hugging Face with the access token setup for “writing” to the account. The notebook cell will be set as follows:

Hugging Face Notebook Login Dialogue

The notebook_login() method call will return the token submission field seen above. Once completed, the notebook will be connected, so that the model repo can be created and populated upon initiation of model training. In general, connecting in this form is not required for accessing public datasets or models from Hugging Face. But to commit to the Hub repo or access private models within your profile a login is required.

The Colab Notebook: Prepping the Data

Working with text means employing tokenizers. Tokenizers process text by segmenting the sequences into tokens: atomic parts that are converted to numerical IDs, and eventually a representative tensor. Models can use any number of different tokenization systems, some are character-based, and others separate text into larger elements such as whole words. Fine-tuning a model requires the use of the same tokenization system applied in the original training. This is where the transformers library can provide high-level, concise abstractions that prevent us from having to retrieve and specify details regarding the selected model.

All the models in the repository include implementation information in a “model card”, which includes details regarding how to load the model, the data that was used in training, potential biases, etc.

The Deberta Model Info

The card shown above is for the “deberta-v3-base” model, the model of choice for this example. But many others can be substituted by simply exchanging the model call. Models with similar features can be searched in the Models repo through the selection of tags seen across the top of the card such as “English” for the language of the model, the applicable framework, etc.

The transformers library provides us with the AutoTokenizer method that automatically selects the appropriate tokenizer given our model, as seen in the following cell:

Tokenizer Import

Note: To fine-tune the “deberta-v3-base" model one additional installation is required: pip install sentencepiece. This is the tokenizer that was used in the original training. Other models will not require an additional installation, if there is a library required that is not present an error message will be raised.

Next, we define the tokenization function and apply it to the dataset to create a new “tokenized” set. The function we use for applying the tokenization is a definition that supplies the access format given how the dataset is constructed. It can also be configured to address several variables, and in this case, we specify that the elements should be truncated. Truncation ensures that input sequences do not exceed the maximum length of the model.

Tokenizer Function Cell

The function is then applied with the map() method from the datasets library.

Tokenizer Mapping Cell

If we call up the same sample on the processed dataset, we can see the changes made as a result of the tokenization. Now, each instance contains three additional fields that hold the token IDs, as well as values for the token type and attention mask attributes respectively.

Tokenizer Output Cell

The Colab Notebook: Prepping the Model

For this basic training effort, we will employ a generic metric. Some evaluation metrics are tailored to specific datasets, others to specific tasks (e.g. name entity recognition), while others are more general. In this case, we’re using accuracy, which is the fraction of correct results relative to the total examples evaluated.

Evaluation Library Import

Next, we must define a method for returning our metrics. The definition has just three lines:

  1. The eval_pred parameter is destructured as the predictions and labels variables.
  2. Then predictions variable is reassigned to the max value on the given axis.
  3. The last line returns the accuracy.compute() call with the relevant values for each parameter.

Metrics Function Definition

In addition to the metrics the labels must be specified, as in the following cell:

Data Labels Declaration

The last import we have brings in our model and a pair of methods for our training run.

Model Import Cell

All the inputs are now present. The last definition will configure the training. It will consist of a basic set of parameters (e.g. batch size, etc.) as well as the data and metrics we’ve specified.

The Colab Notebook: The Model Training

The model training function will use the Hugging Face Trainer() method. We instantiate a new instance and pass a minimal set of parameters before making the call to execute trainer.train(). Most of the values passed here represent common defaults. The parameters for the training arguments include:

  • output_dir: The name of the directory (Hub repo).
  • learning_rate: Represents the initial learning rate used by the optimization function.
  • per_device_train_batch_size (and per_device_eval_batch_size): The batch size for each CPU/Core/etc.
  • num_train_epochs: The overall training period.
  • weight_decay: The weight decay or regularization applied to layers.
  • evaluation_strategy (and save_strategy): How frequently the model will be evaluated and saved (these arguments must match).
  • load_best_model_at_end: A boolean that can be set to ensure that the best-performing variation of the model is uploaded instead of the last iteration default.
  • push_to_hub: A boolean that determines if the model is pushed to the Hub repository on every save.

The Trainer object that is called to perform the training is just the aggregation of all the elements configured to this point. The parameters include the downloaded model, the training arguments, the prepped training and testing datasets, and the evaluation metrics.

Model Training Definition

It is important to note that the training may take considerable time to process. All the cells before the training will require minimal processing time; however, the training itself can easily take an hour or more. There are alternative runtimes that can be selected within Colab, such as a T4 GPU or TPU. But availability is not guaranteed.

Once the training is underway, there will be an output that continuously updates its status. It will show the progress of the training as shown below:

Model Training Status Image

Note: Colab notebooks will disconnect if there is inactivity for 90 minutes. Therefore, it’s important to check in on the status of the training periodically or set up a script to keep the notebook active.

Once the training is initiated the new model repo will be available to view. Under the profile view, the new repo will appear under the Model heading with the given directory name. By selecting the model, we can bring up the repo as seen below:

New Model Repo View

The Colab Notebook: Returning Predictions

Once the model training is completed, we have a new model that can be used to label or predict values, for new data. We can create an original text akin to the reviews data we’ve used for the training to solicit a prediction, also known as running inference.

Mock Film Review Text

We can use the pipeline method to easily submit (process) the example and return a prediction.

Prediction Output

At the bottom of the output, we can see the label returned and the associated prediction score.

Now, moving forward, this model can be accessed at any time to run inference by calling the model directly (as in the cell above) without any additional prep work.

Conclusion

This tutorial has been a quick introduction to some of the resources and tools available through the Hugging Face platform. With a little bit of experience and some review of the documentation, anyone can take advantage of the open-source models available on Hugging Face to create a custom model. This tutorial focused on leveraging an LLM for text classification, but many of the general steps and methods can be applied with only minor changes to develop a wholly different model, for a significantly different purpose.

To learn more about AI and the many libraries now available for enabling integration into a workflow or app check our content here: AI Articles.