Codecademy Logo

Text Classification with PyTorch

Tokenization

Tokenization is the process of breaking down a text into individual units called tokens.

Tokenization strategies include:

  • word-based tokenization breaks down a text into individual word-based tokens
  • subword-based tokenization breaks down a word into individual subword-based tokens
  • character-based tokenization breaks down a word into individual character-based tokens
text = '''Vanity and pride are different things'''
# word-based tokenization
words = ['Vanity', 'and', 'pride', 'are', 'different', 'things']
# subword-based tokenization
subwords = ['Van', 'ity', 'and', 'pri', 'de', 'are', 'differ', 'ent', 'thing', 's']
# character-based tokenization
characters = ['V', 'a', 'n', 'i', 't', 'y', ' ', 'a', 'n', 'd', ' ', 'p', 'r', 'i', 'd', 'e', ' ', 'a', 'r', 'e', ' ', 'd', 'i', 'f', 'f', 'e', 'r', 'e', 'n', 't', ' ', 't', 'h', 'i', 'n', 'g', 's']

Handling Out-of-Vocabulary Tokens

Out-of-vocabulary (OOV) tokens are words not in a model’s vocabulary. They are replaced with a default token ID, often <unk>. This ensures text encoding completeness during processing.

# Vocabulary dictionary
vocab = {
'the': 0,
'future': 2,
'belongs': 3,
'to': 4,
'those': 5,
'who': 6,
'believe': 7,
'in': 8,
'dreams': 9,
'<unk>': 1
}
# Sample sentence
tokenized_sentence = "The future belongs to those who innovate."
# Tokenize the sentence considering OOV words
tokenized_id_sentence = [vocab.get(word, vocab['<unk>']) for word in tokenized_sentence.split()]
# Output the tokenized sentence
print(tokenized_id_sentence)
# Output: [1, 2, 3, 4, 5, 6, 1]

Subword tokenization

Subword tokenization breaks words into smaller known subwords, aiding in handling out-of-vocabulary terms. This helps maintain data integrity and improves model predictions in natural language processing tasks.

# Example of subword tokenization with Hugging Face Tokenizers
from transformers import BertTokenizer
# Initialize tokenizer with pre-trained model
model_name = 'huawei-noah/TinyBERT_General_4L_312D'
tinybert_tokenizer = BertTokenizer.from_pretrained(model_name)
# Encoding a sentence
sentence = "Subword tokenization is useful."
subwords = tinybert_tokenizer.tokenize(sentence)
# Display tokens
print(subwords)
# Output: ['sub', '##word', 'token', '##ization', 'is', 'useful', '.']

Tokenization with Unknown

In language modeling, out-of-vocabulary words are assigned a special token, such as <unk>, during tokenization. This helps manage unknown inputs by replacing them with a known, default value, ensuring model consistency.

# Define the sentence and vocabulary
tokenize = lambda x: x.split() # Simple tokenizer splitting by spaces
sentence = "The future belongs to those who shape it."
vocab = {'<unk>': 1, 'The': 3, 'future': 142, 'belongs': 125, 'to': 203, 'shape': 88}
# Tokenization process with handling of unknown words
tokenized_id_sentence = [vocab.get(word, vocab['<unk>']) for word in tokenize(sentence)]
print(tokenized_id_sentence)
# Output: [3, 142, 125, 203, 1, 1, 88, 1]

Python Text Padding

Padding ensures text sequences are equal length by adding tokens like <pad> to the sequences. This is crucial for tasks like batching data to be used as input into a model.

# Initial text sequence with 3 tokens
text = ["I", "enjoy", "coding"]
text_id = [13, 444, 0]
max_len = 5
# Pad the sequence with additional tokens if necessary
if len(text_id) < max_len:
padded_text_id = text_id + [1] * (max_len - len(text_id))
print(padded_text_id)
# Outputs: [13, 444, 0, 1, 1]

Text Truncation

Truncating text involves shortening longer sequences to fit a maximum length. Truncating is crucial to ensure that all text sequences have the same length. You can truncate a sequence from the beginning, end, or both.

# Specify a maximum sequence length
max_len = 5
# Text sequence with 7 tokens truncated to 5 tokens
text = ["Good", "programmers", "write", "code", "humans", "can", "understand"]
text_id = [52, 0, 0, 0, 458, 12, 337]
# Truncate from the end
if len(text_id) > max_len:
truncated_text = text_id[:max_len]
print(truncated_text)
# Output: ['Good', 'programmers', 'write', 'code', 'humans']

Fine-tuning Models

Fine-tuning a pre-trained model is a crucial step in adapting it to specific tasks. This involves freezing certain layers in the model to maintain learned features while updating others with task-specific data, enhancing model performance without losing general knowledge.

The general steps of fine-tuning involve:

  1. freezing the weights in certain, specified layers
  2. unfreezing the weights in certain, specified layers
  3. update the weights in the unfrozen layers with a task-specific data

Gradually unfreezing additional layers may help improve model performance.

# Freezing and Unfreezing layers in a BERT model
# Freeze pre-trained layers
for param in pretrained_bert.bert.parameters():
param.requires_grad = False
# Unfreeze classifier layer
for param in pretrained_bert.classifier.parameters():
param.requires_grad = True
# Unfreeze encoder layer
for param in pretrained_bert.bert.encoder.layer[3].parameters():
param.requires_grad = True
# Note: This setup allows us to fine-tune our model on specific tasks
# while preserving the general language understanding.

BERT Transformer Model

BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only transformer model. It excels at interpreting a token’s meaning by considering its context based on surrounding tokens, looking at both directions—left and right. This bidirectional attention allows BERT to understand nuanced meanings of words and phrases within a sequence.

# Load a Pre-trained BERT
from transformers import BertTokenizer, BertForSequenceClassification
model_name = 'bert-base-uncased'
bert_tokenizer = BertTokenizer.from_pretrained(model_name)
pretrained_bert = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Self-Attention Mechanism sequences

The self-attention mechanism is a crucial component of transformers that allows them to focus on various parts of an input sequence. This helps in capturing contextual relationships more effectively, especially in longer sequences than previous models like RNNs or LSTMs. Specifically, self-attention enables transformers to process sequences in parallel, improving performance and understanding of the sequence meaning.

Classification Metrics - Precision, Recall, and F1 Score

Evaluation metrics other than accuracy include precision, recall, and F1-score.

Precision pays attention to false positives whereas recall pays attention to false negatives:

Precision=TPTP+FPRecall=TPTP+FN\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}} \hspace{1cm} \text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}

The F1 Score is the harmonic mean of precision and recall that balances concerns for false positives and false negatives:

F1=2PrecisionRecallPrecision+Recall\text{F1}=\frac{2*\text{Precision}*\text{Recall}}{\text{Precision} + \text{Recall}}

The classification report generates a summary of the precision, recall, and F1 scores for each class where in the cases with more than two classes:

  • the macro average gives equal weight to each class
  • the micro average gives more weight to classes with a larger # of observations (support).
from sklearn.metrics import classification_report
report = classification_report(true_labels, predicted_labels)

Learn more on Codecademy