PyTorch
Image ModelsPyTorch provides comprehensive tools for vision tasks through libraries like torchvision
:
from torchvision import datasets, transforms, models# Load a pre-built model for classificationresnet = models.resnet50(pretrained=True)# Load a datasetcifar10 = datasets.CIFAR10(root='./data',train=True,download=True,transform=transforms.ToTensor())
DETR
ModelsObject detection involves both classification and localization:
DETR (DEtection
TRansformer) approaches this as a set prediction problem, using transformers to directly output a fixed set of predictions.from transformers import DetrImageProcessor, DetrForObjectDetectionimport torch# Load DETR model and processormodel_name = "facebook/detr-resnet-50"processor = DetrImageProcessor.from_pretrained(model_name)model = DetrForObjectDetection.from_pretrained(model_name)# Process image and get predictionsimage = load_image("street_scene.jpg") # Load your imageinputs = processor(images=image, return_tensors="pt")with torch.no_grad():outputs = model(**inputs)# Get predicted boxes and classespred_boxes = outputs.pred_boxes[0] # Bounding box coordinatespred_scores = outputs.logits[0].softmax(-1) # Class probabilitiespred_labels = pred_scores.argmax(-1) # Predicted class labels
The pre-trained DETR
model from Facebook’s AI can be easily accessed using Hugging Face’s transformers
module. Use DetrImageProcessor
for processing images and DetrForObjectDetection
for the model. DETR
integrates a CNN backbone, Transformer Encoder, Decoder, and Feedforward Neural Network, making it robust for object detection tasks.
from transformers import DetrImageProcessor, DetrForObjectDetection# Load pre-trained model and processorprocessor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
Object detection models are evaluated using:
# Example of IoU calculationdef calculate_iou(box1, box2):"""Calculate IoU between two bounding boxesEach box format: [x1, y1, x2, y2] (top-left and bottom-right corners)"""# Calculate intersection coordinatesx1 = max(box1[0], box2[0])y1 = max(box1[1], box2[1])x2 = min(box1[2], box2[2])y2 = min(box1[3], box2[3])# Calculate intersection areaintersection = max(0, x2 - x1) * max(0, y2 - y1)# Calculate union areabox1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])union = box1_area + box2_area - intersection# Calculate IoUiou = intersection / union if union > 0 else 0return iou
ViTs
Vision Transformers, or ViTs
, revolutionize how machines analyze visual data by adapting the attention mechanism from traditional transformers to comprehend images. ViTs
are adept at tasks like image classification, object detection, and image segmentation. Vision Transformers adapt the transformer architecture for images by:
Google ViT Base
Google’s pre-trained ViT models can be easily loaded through Hugging Face:
"google/vit-base-patch16-224"
indicates:AutoImageProcessor
handles image preprocessingAutoModelForImageClassification
loads model weightsfrom transformers import AutoImageProcessor, AutoModelForImageClassificationimport torch# Load pre-trained ViT processor and modelmodel_name = "google/vit-base-patch16-224"processor = AutoImageProcessor.from_pretrained(model_name)vit_model = AutoModelForImageClassification.from_pretrained(model_name)# Process an image for the modelimage = load_image("cat.jpg") # Load your imageinputs = processor(images=image, return_tensors="pt")# Get predictionswith torch.no_grad():outputs = vit_model(**inputs)logits = outputs.logitspredicted_class = logits.argmax(-1).item()
Transfer learning leverages pre-trained models to improve performance on new tasks:
This approach requires less data and training time than training from scratch.
# Fine-tuning a pre-trained ViT for a new classification taskfrom transformers import AutoImageProcessor, AutoModelForImageClassificationimport torch.nn as nnimport torch.optim as optim# Load pre-trained modelmodel_name = "google/vit-base-patch16-224"processor = AutoImageProcessor.from_pretrained(model_name)vit_model = AutoModelForImageClassification.from_pretrained(model_name)# 1. Replace classification head for new task (10 classes)vit_model.classifier = nn.Linear(vit_model.classifier.in_features, 10)# 2. Freeze feature extraction layersfor param in vit_model.vit.parameters():param.requires_grad = False# 3. Unfreeze specific layers to fine-tune# Unfreeze last encoder layerfor param in vit_model.vit.encoder.layer[11].parameters():param.requires_grad = True# 4. Set up optimizer with different learning ratesoptimizer = optim.AdamW([{'params': vit_model.classifier.parameters(), 'lr': 0.0003},{'params': vit_model.vit.encoder.layer[11].parameters(), 'lr': 0.0001}], weight_decay=0.001)# 5. Train with fine-tuningcriterion = nn.CrossEntropyLoss()# (Training loop would follow)