PyTorch
Image ModelsPyTorch provides comprehensive tools for vision tasks through libraries like torchvision
:
from torchvision import datasets, transforms, models# Load a pre-built model for classificationresnet = models.resnet50(pretrained=True)# Load a datasetcifar10 = datasets.CIFAR10(root='./data',train=True,download=True,transform=transforms.ToTensor())
PyTorch
DataLoadersDataLoaders
in PyTorch
are essential for managing image data. They efficiently handle batching, shuffling, and transformation during training. This is crucial for optimizing model performance and ensuring variability across training epochs.
from torch.utils.data import DataLoader# Create a DataLoader with batch size of 64# Shuffle training data to prevent overfittingdataloader = DataLoader(dataset,batch_size=64,shuffle=True)# Usage in training loopfor images, labels in dataloader:# Each iteration loads a batch of 64 imagesoutputs = model(images)loss = criterion(outputs, labels)# ...continue with backpropagation
Image transformations standardize data for model input:
Transformations are applied sequentially and should be identical for training and testing sets (except augmentations).
from torchvision import transforms# Create transformation pipelinetransform = transforms.Compose([transforms.Resize((64, 64)), # Resize to 64x64 pixelstransforms.ToTensor(), # Convert to tensor, scale to [0.0, 1.0]transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) # Normalize RGB channels])# Apply transformations when loading datasetdataset = datasets.CIFAR10(root='./data',train=True,transform=transform,download=True)
Augmentations
Pre-processing images using augmentations such as flipping, rotating, and color jittering can enhance model performance by providing diverse image representations. Image augmentations create diverse variants of training images to improve model generalization. Augmentations are applied only to training data, not testing/validation data. These techniques help prevent overfitting, ensuring the vision model generalizes well to new data.
# Training transforms with augmentationstrain_transform = transforms.Compose([transforms.RandomHorizontalFlip(), # 50% chance of flipping horizontallytransforms.RandomRotation(15), # Rotate ±15 degreestransforms.ColorJitter(brightness=0.2), # Adjust brightness by ±20%transforms.Resize((224, 224)),transforms.ToTensor(),transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])# Testing transforms without augmentationstest_transform = transforms.Compose([transforms.Resize((224, 224)),transforms.ToTensor(),transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])
Convolutional Neural Networks (CNNs) excel at image tasks through specialized layers:
CNNs are the backbone for many vision applications like image classification.
import torch.nn as nnimport torch.nn.functional as Fclass SimpleCNN(nn.Module):def __init__(self):super(SimpleCNN, self).__init__()# Convolutional layer: 3 input channels, 12 filters, 3x3 kernelself.conv1 = nn.Conv2d(3, 12, kernel_size=3, padding=1)# Fully connected layersself.fc1 = nn.Linear(12 * 16 * 16, 64)self.fc2 = nn.Linear(64, 10) # 10 output classesdef forward(self, x):# Apply convolution and ReLU activationx = F.relu(self.conv1(x))# Apply max pooling (2x2)x = F.max_pool2d(x, 2)# Flatten for fully connected layerx = x.view(x.size(0), -1)# Pass through fully connected layersx = F.relu(self.fc1(x))x = self.fc2(x)return x
Conv2d
BasicsA convolutional layer is essential in Convolutional Neural Networks (CNNs). In PyTorch, you initialize it using nn.Conv2d
. Customize your setup with the number of input nodes, filters, kernel size, and padding to tailor-fit your neural network’s needs.
import torchimport torch.nn as nnconv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=(3, 3), padding=1)# Example input with dimensions (batch_size=1, channels=3, height=32, width=32)input_tensor = torch.randn(1, 3, 32, 32)# Forward passoutput_tensor = conv_layer(input_tensor)print(output_tensor.shape) # Expected: [1, 16, 32, 32]
ViTs
Vision Transformers, or ViTs
, revolutionize how machines analyze visual data by adapting the attention mechanism from traditional transformers to comprehend images. ViTs
are adept at tasks like image classification, object detection, and image segmentation. Vision Transformers adapt the transformer architecture for images by:
Google ViT Base
Google’s pre-trained ViT models can be easily loaded through Hugging Face:
"google/vit-base-patch16-224"
indicates:AutoImageProcessor
handles image preprocessingAutoModelForImageClassification
loads model weightsfrom transformers import AutoImageProcessor, AutoModelForImageClassificationimport torch# Load pre-trained ViT processor and modelmodel_name = "google/vit-base-patch16-224"processor = AutoImageProcessor.from_pretrained(model_name)vit_model = AutoModelForImageClassification.from_pretrained(model_name)# Process an image for the modelimage = load_image("cat.jpg") # Load your imageinputs = processor(images=image, return_tensors="pt")# Get predictionswith torch.no_grad():outputs = vit_model(**inputs)logits = outputs.logitspredicted_class = logits.argmax(-1).item()
DETR
ModelsObject detection involves both classification and localization:
DETR (DEtection
TRansformer) approaches this as a set prediction problem, using transformers to directly output a fixed set of predictions.from transformers import DetrImageProcessor, DetrForObjectDetectionimport torch# Load DETR model and processormodel_name = "facebook/detr-resnet-50"processor = DetrImageProcessor.from_pretrained(model_name)model = DetrForObjectDetection.from_pretrained(model_name)# Process image and get predictionsimage = load_image("street_scene.jpg") # Load your imageinputs = processor(images=image, return_tensors="pt")with torch.no_grad():outputs = model(**inputs)# Get predicted boxes and classespred_boxes = outputs.pred_boxes[0] # Bounding box coordinatespred_scores = outputs.logits[0].softmax(-1) # Class probabilitiespred_labels = pred_scores.argmax(-1) # Predicted class labels
The pre-trained DETR
model from Facebook’s AI can be easily accessed using Hugging Face’s transformers
module. Use DetrImageProcessor
for processing images and DetrForObjectDetection
for the model. DETR
integrates a CNN backbone, Transformer Encoder, Decoder, and Feedforward Neural Network, making it robust for object detection tasks.
from transformers import DetrImageProcessor, DetrForObjectDetection# Load pre-trained model and processorprocessor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
Object detection models are evaluated using:
# Example of IoU calculationdef calculate_iou(box1, box2):"""Calculate IoU between two bounding boxesEach box format: [x1, y1, x2, y2] (top-left and bottom-right corners)"""# Calculate intersection coordinatesx1 = max(box1[0], box2[0])y1 = max(box1[1], box2[1])x2 = min(box1[2], box2[2])y2 = min(box1[3], box2[3])# Calculate intersection areaintersection = max(0, x2 - x1) * max(0, y2 - y1)# Calculate union areabox1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])union = box1_area + box2_area - intersection# Calculate IoUiou = intersection / union if union > 0 else 0return iou
Transfer learning leverages pre-trained models to improve performance on new tasks:
This approach requires less data and training time than training from scratch.
# Fine-tuning a pre-trained ViT for a new classification taskfrom transformers import AutoImageProcessor, AutoModelForImageClassificationimport torch.nn as nnimport torch.optim as optim# Load pre-trained modelmodel_name = "google/vit-base-patch16-224"processor = AutoImageProcessor.from_pretrained(model_name)vit_model = AutoModelForImageClassification.from_pretrained(model_name)# 1. Replace classification head for new task (10 classes)vit_model.classifier = nn.Linear(vit_model.classifier.in_features, 10)# 2. Freeze feature extraction layersfor param in vit_model.vit.parameters():param.requires_grad = False# 3. Unfreeze specific layers to fine-tune# Unfreeze last encoder layerfor param in vit_model.vit.encoder.layer[11].parameters():param.requires_grad = True# 4. Set up optimizer with different learning ratesoptimizer = optim.AdamW([{'params': vit_model.classifier.parameters(), 'lr': 0.0003},{'params': vit_model.vit.encoder.layer[11].parameters(), 'lr': 0.0001}], weight_decay=0.001)# 5. Train with fine-tuningcriterion = nn.CrossEntropyLoss()# (Training loop would follow)