When finetuning, examples need to be passed to the model in uniform length so they can be processed in parallel. Hugging Face padding
and truncation
values. Padding is often determined by the longest sequence in the batch, while truncation automatically cuts a sequence off at the model’s maximum input length.
tokenized_text = tokenizer(text, padding="longest", truncation=True)
To tokenize a Hugging Face dataset, use the dataset instance’s .map method, passing in a function that receives a string and outputs a sequence of tokens. A second, named parameter of batched=True
will ensure the data is tokenized in batches.
def tokenize_function(example_text):return tokenizer(example_text, padding="longest", truncation=True)tokenized_dataset= dataset.map(tokenize_function, batched=True)
Moving models and data to the GPU in PyTorch requires calling the .to()
method and passing in torch.device("cuda")
.
# device-agnostic code:device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
When your hyperparameters, model, and data are configured, you can finetune with the Hugging Face Trainer API by calling trainer.train()
. Afterward, you can call trainer.evaluate()
to gauge its performance against test data.
Finetuning runs can be performed with low-rank adaptation (LoRA) via Hugging Face’s peft
library. Pass hyperparameters to LoraConfig()
, then pass that config to the get_peft_model()
along with the base model. A good starting point for the alpha hyperparameter is double the value of rank.
One popular choice of library for quantizing large language models is bitsandbytes
, which can be used to quantize models to a variety of bit sizes, shrinking them for use on consumer hardware.
Perplexity (PPL) is a popular evaluation metric for generative language models, defined as the exponentiated cross-entropy of a sequence’s probability. It’s a good way to gauge how effective a model is at predicting some target text.
Finetuning hyperparameters are configured via the TrainingArguments
function in the transformers
library, where epochs, learning rate, and other common training hyperparameters can be set.
training_args = TrainingArguments(output_dir="./temp_results",num_train_epochs=3,per_device_train_batch_size=12,per_device_eval_batch_size=12,warmup_steps=500,weight_decay=0.01,learning_rate=1e-4,)