Preparing a dataset for language model training involves three steps: preprocessing, which cleans and standardizes text; tokenization, which transforms text into numerical indices; and batching, which groups sequences to enhance training efficiency.
import torchimport re# Step 1: Preprocessing - Clean and standardize texttext = "Hello, World! Welcome to Language Modeling."text = re.sub(r'[^a-zA-Z\s,.:;?_&$!"()\-\*\[\]]', '', text.lower()).strip()# Step 2: Tokenization - Convert text into numerical indiceschars = sorted(list(set(text)))stoi = {ch: i for i, ch in enumerate(chars)}encoded_text = [stoi[c] for c in text]# Step 3: Batching - Group sequences for efficient trainingbatch_size, block_size = 4, 8data = torch.tensor(encoded_text, dtype=torch.long)batched_data = [data[i:i+block_size] for i in range(0, len(data), block_size)]
Encoding functions convert text into lists of indices for efficient tokenization, while decoding functions revert indices back to text. In a bigram model, this is done using stoi
(string to index) and itos
(index to string) dictionaries.
# Create mappings for encoding and decodingchars = sorted(list(set("sample text for encoding")))stoi = {ch: i for i, ch in enumerate(chars)}itos = {i: ch for i, ch in enumerate(chars)}# Encoding function: Convert text to indicesencode = lambda s: [stoi[c] for c in s]# Decoding function: Convert indices back to textdecode = lambda l: ''.join([itos[i] for i in l])# Example usageencoded = encode("sample")decoded = decode(encoded)print("Encoded:", encoded)print("Decoded:", decoded)
The attention mechanism helps transformers identify important words in a sentence. It uses matrix multiplication and the softmax function to assign weights to words, combining their embeddings based on importance and context. This allows the model to understand relationships between words, even those far apart in the text.
import torchimport torch.nn.functional as FB, T, C = 2, 5, 4 # Batch size, Sequence length, Embedding dimensionx = torch.randn(B, T, C) # Random input embeddings# Define query, key, and value transformationsquery = x # Using the input embeddings as querieskey = x # Using the input embeddings as keysvalue = x # Using the input embeddings as values# Step 1: Calculate attention scores using matrix multiplicationattention_scores = torch.matmul(query, key.transpose(-2, -1)) # Shape: (B, T, T)# Step 2: Apply softmax to convert scores into attention weightsattention_weights = F.softmax(attention_scores, dim=-1) # Shape: (B, T, T)# Step 3: Use the attention weights to combine the value embeddingsoutput = torch.matmul(attention_weights, value) # Shape: (B, T, C)# Output the resultsprint("Attention Weights:\n", attention_weights)print("Output Embeddings:\n", output)
Self-attention in transformers uses key, query, and value projections along with positional embeddings. This combination creates context-aware representations that capture both the relationships between tokens and the overall structure of the sequence.
import torchimport torch.nn.functional as FB, T, C = 2, 5, 4 # Batch size, Sequence length, Embedding dimensionx = torch.randn(B, T, C) # Random input embeddings# Positional embeddings: Simulating positional information added to xpositional_embeddings = torch.randn(B, T, C)x = x + positional_embeddings # Adding positional information to embeddings# Using x as the query, key, and value for simplicityquery = xkey = xvalue = x# Step 1: Calculate attention scores manuallyscores = (query * key).sum(dim=-1) / torch.sqrt(torch.tensor(C, dtype=torch.float32))# Step 2: Apply softmax to get attention weightsattention_weights = F.softmax(scores, dim=-1)# Step 3: Compute the output by combining the value embeddings with attention weightsoutput = (attention_weights.unsqueeze(-1) * value).sum(dim=1)# Output the resultsprint("Attention Weights:\n", attention_weights)print("Self-Attention Output:\n", output)
Positional encodings provide information about the order of tokens in the input sequence, which is crucial because transformers do not process data sequentially. By adding positional encodings to token embeddings, the model can capture the structure and relationships within a sentence, aiding in tasks like predicting the next token in a bigram model.
import numpy as npdef generate_positional_embeddings(T, C):# Initialize positional embeddings with zerospositional_embeddings = np.zeros((T, C))# Calculate the embeddings using sine and cosine functionsfor t in range(T):for c in range(C):if c % 2 == 0:positional_embeddings[t, c] = np.sin(t / (10000 ** (c / C)))else:positional_embeddings[t, c] = np.cos(t / (10000 ** (c / C)))return positional_embeddings# Call the function with T=10 and C=4positional_embeddings = generate_positional_embeddings(10, 4)print("Positional Embeddings:\n", positional_embeddings)
Feedforward Layers
Feedforward layers in transformers are similar to those in standard neural networks. They apply linear transformations followed by non-linear activation functions, like ReLU, to token representations. This step enhances the model’s ability to learn complex patterns before passing data to the attention layers.
import torchimport torch.nn.functional as FB, C = 3, 8 # Batch size, Embedding dimensionx = torch.randn(B, C) # Random input token representations# Define a feedforward layer with a linear transformation and ReLU activationlinear_layer = torch.nn.Linear(C, C)output = F.relu(linear_layer(x))# Output the resultsprint("Input Token Representations:\n", x)print("\nOutput After Feedforward Layer:\n", output)
Layer normalization stabilizes the training of transformer models by normalizing token embeddings. This process helps improve convergence and overall model performance by ensuring consistent distribution of input values to each layer.
import torchimport torch.nn as nnB, T, C = 3, 5, 4 # Batch size, Sequence length, Embedding dimensionx = torch.randn(B, T, C) # Random input token embeddings# Apply layer normalizationlayer_norm = nn.LayerNorm(C)normalized_x = layer_norm(x)# Output the resultsprint("Original Token Embeddings:\n", x)print("\nNormalized Token Embeddings:\n", normalized_x)
Single-head attention computes one set of attention scores, capturing only one type of relationship at a time. Multi-head attention uses multiple attention heads to capture various relationships within the input sequence, allowing the model to understand complex dependencies more effectively.