Tokenization is the process of breaking down a text into individual units called tokens.
Tokenization strategies include:
text = '''Vanity and pride are different things'''# word-based tokenizationwords = ['Vanity', 'and', 'pride', 'are', 'different', 'things']# subword-based tokenizationsubwords = ['Van', 'ity', 'and', 'pri', 'de', 'are', 'differ', 'ent', 'thing', 's']# character-based tokenizationcharacters = ['V', 'a', 'n', 'i', 't', 'y', ' ', 'a', 'n', 'd', ' ', 'p', 'r', 'i', 'd', 'e', ' ', 'a', 'r', 'e', ' ', 'd', 'i', 'f', 'f', 'e', 'r', 'e', 'n', 't', ' ', 't', 'h', 'i', 'n', 'g', 's']
Out-of-vocabulary (OOV) tokens are words not in a model’s vocabulary. They are replaced with a default token ID, often <unk>
. This ensures text encoding completeness during processing.
# Vocabulary dictionaryvocab = {'the': 0,'future': 2,'belongs': 3,'to': 4,'those': 5,'who': 6,'believe': 7,'in': 8,'dreams': 9,'<unk>': 1}# Sample sentencetokenized_sentence = "The future belongs to those who innovate."# Tokenize the sentence considering OOV wordstokenized_id_sentence = [vocab.get(word, vocab['<unk>']) for word in tokenized_sentence.split()]# Output the tokenized sentenceprint(tokenized_id_sentence)# Output: [1, 2, 3, 4, 5, 6, 1]
Subword tokenization breaks words into smaller known subwords, aiding in handling out-of-vocabulary terms. This helps maintain data integrity and improves model predictions in natural language processing tasks.
# Example of subword tokenization with Hugging Face Tokenizersfrom transformers import BertTokenizer# Initialize tokenizer with pre-trained modelmodel_name = 'huawei-noah/TinyBERT_General_4L_312D'tinybert_tokenizer = BertTokenizer.from_pretrained(model_name)# Encoding a sentencesentence = "Subword tokenization is useful."subwords = tinybert_tokenizer.tokenize(sentence)# Display tokensprint(subwords)# Output: ['sub', '##word', 'token', '##ization', 'is', 'useful', '.']
Unknown
In language modeling, out-of-vocabulary words are assigned a special token, such as <unk>
, during tokenization. This helps manage unknown inputs by replacing them with a known, default value, ensuring model consistency.
# Define the sentence and vocabularytokenize = lambda x: x.split() # Simple tokenizer splitting by spacessentence = "The future belongs to those who shape it."vocab = {'<unk>': 1, 'The': 3, 'future': 142, 'belongs': 125, 'to': 203, 'shape': 88}# Tokenization process with handling of unknown wordstokenized_id_sentence = [vocab.get(word, vocab['<unk>']) for word in tokenize(sentence)]print(tokenized_id_sentence)# Output: [3, 142, 125, 203, 1, 1, 88, 1]
Padding ensures text sequences are equal length by adding tokens like <pad>
to the sequences. This is crucial for tasks like batching data to be used as input into a model.
# Initial text sequence with 3 tokenstext = ["I", "enjoy", "coding"]text_id = [13, 444, 0]max_len = 5# Pad the sequence with additional tokens if necessaryif len(text_id) < max_len:padded_text_id = text_id + [1] * (max_len - len(text_id))print(padded_text_id)# Outputs: [13, 444, 0, 1, 1]
Truncating text involves shortening longer sequences to fit a maximum length. Truncating is crucial to ensure that all text sequences have the same length. You can truncate a sequence from the beginning, end, or both.
# Specify a maximum sequence lengthmax_len = 5# Text sequence with 7 tokens truncated to 5 tokenstext = ["Good", "programmers", "write", "code", "humans", "can", "understand"]text_id = [52, 0, 0, 0, 458, 12, 337]# Truncate from the endif len(text_id) > max_len:truncated_text = text_id[:max_len]print(truncated_text)# Output: ['Good', 'programmers', 'write', 'code', 'humans']
Fine-tuning a pre-trained model is a crucial step in adapting it to specific tasks. This involves freezing certain layers in the model to maintain learned features while updating others with task-specific data, enhancing model performance without losing general knowledge.
The general steps of fine-tuning involve:
Gradually unfreezing additional layers may help improve model performance.
# Freezing and Unfreezing layers in a BERT model# Freeze pre-trained layersfor param in pretrained_bert.bert.parameters():param.requires_grad = False# Unfreeze classifier layerfor param in pretrained_bert.classifier.parameters():param.requires_grad = True# Unfreeze encoder layerfor param in pretrained_bert.bert.encoder.layer[3].parameters():param.requires_grad = True# Note: This setup allows us to fine-tune our model on specific tasks# while preserving the general language understanding.
BERT
Transformer ModelBERT
(Bidirectional Encoder Representations from Transformers) is an encoder-only transformer model. It excels at interpreting a token’s meaning by considering its context based on surrounding tokens, looking at both directions—left and right. This bidirectional attention allows BERT
to understand nuanced meanings of words and phrases within a sequence.
# Load a Pre-trained BERTfrom transformers import BertTokenizer, BertForSequenceClassificationmodel_name = 'bert-base-uncased'bert_tokenizer = BertTokenizer.from_pretrained(model_name)pretrained_bert = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
The self-attention mechanism is a crucial component of transformers that allows them to focus on various parts of an input sequence. This helps in capturing contextual relationships more effectively, especially in longer sequences than previous models like RNNs or LSTMs. Specifically, self-attention enables transformers to process sequences in parallel, improving performance and understanding of the sequence meaning.
Evaluation metrics other than accuracy include precision, recall, and F1-score.
Precision pays attention to false positives whereas recall pays attention to false negatives:
The F1 Score is the harmonic mean of precision and recall that balances concerns for false positives and false negatives:
The classification report generates a summary of the precision, recall, and F1 scores for each class where in the cases with more than two classes:
from sklearn.metrics import classification_reportreport = classification_report(true_labels, predicted_labels)