Semi-Supervised Learning
Semi-Supervised Learning is an approach in machine learning that combines elements of both supervised learning and unsupervised learning. It is particularly helpful when a dataset contains a small amount of labeled data and a large amount of unlabeled data.
By leveraging patterns in the unlabeled data, semi-supervised learning improves model accuracy and generalization while reducing the reliance on extensive labeled datasets.
Syntax
The general approach for implementing semi-supervised learning follows these steps:
1. Train an initial model using the available labeled data.
2. Use the trained model to predict labels for the unlabeled data.
3. Select high-confidence pseudo-labels and add them to the labeled dataset.
4. Retrain the model with the expanded labeled dataset.
5. Repeat steps 2-4 iteratively until convergence or stopping criteria is met.
Example
A common example of semi-supervised learning is using a self-training classifier with Scikit-learn in Python:
import numpy as npfrom sklearn import datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn.semi_supervised import SelfTrainingClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_score# Load datasetX, y = datasets.load_digits(return_X_y=True)# Create a mask to simulate unlabeled data (-1 represents unlabeled samples)unlabeled_mask = np.random.rand(len(y)) < 0.8y_unlabeled = np.copy(y)y_unlabeled[unlabeled_mask] = -1 # Set 80% of labels to -1 (unlabeled)# Split into training and test datasetsX_train, X_test, y_train, y_test_masked = train_test_split(X, y_unlabeled, test_size=0.2, random_state=42)# Get the true labels for the test set_, y_test_true = train_test_split(y, test_size=0.2, random_state=42) # True labels for evaluation# Define the base classifierbase_classifier = RandomForestClassifier(n_estimators=100, random_state=42)# Create the semi-supervised modelself_training_model = SelfTrainingClassifier(base_classifier)# Train the modelself_training_model.fit(X_train, y_train)# Get predictionsy_pred = self_training_model.predict(X_test)# Evaluate the modelaccuracy = accuracy_score(y_test_true, y_pred)print(f"Semi-Supervised Model Accuracy: {accuracy:.2f}")
This example demonstrates the use of a self-training classifier, where an initial model is trained on labeled data and iteratively labels the unlabeled data to improve its learning capability. The output of this code will be:
Semi-Supervised Model Accuracy: 0.89
Note: Since the dataset splitting and unlabeled mask generation involve randomness (
np.random.rand()
andtrain_test_split()
), the accuracy may change slightly each time unless a fixed random seed (np.random.seed()
) is set before creating the mask.
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.
Learn AI on Codecademy
- Skill path
Intermediate Machine Learning
Level up your machine learning skills with tuning methods, advanced models, and dimensionality reduction.Includes 5 CoursesWith CertificateIntermediate8 hours - Course
Learn Python 3
Learn the basics of Python 3.12, one of the most powerful, versatile, and in-demand programming languages today.With CertificateBeginner Friendly23 hours