Scikit-Learn Tutorial: Python Machine Learning Model Building
What is scikit-learn?
Scikit-learn (often shortened to “sklearn”) is a free, open-source machine learning library for Python that provides simple and efficient tools for data analysis and modeling. Built on NumPy, SciPy, and matplotlib, scikit-learn has become the go-to library for many data scientists and machine learning practitioners. Whether you’re a beginner just starting your machine learning journey or an experienced practitioner looking for reliable implementations, scikit-learn offers a consistent interface that makes experimenting with different algorithms straightforward and accessible.
The library was initially developed by David Cournapeau as part of a Google Summer of Code project in 2007. Since then, it has grown into a robust ecosystem maintained by a diverse community of contributors worldwide. The name “scikit-learn” comes from the fact that it’s a “SciKit” (SciPy Toolkit), an add-on package for SciPy, focusing specifically on machine learning algorithms.
The functionality that scikit-learn provides includes:
- Regression, including Linear and Logistic Regression
- Classification, including K-Nearest Neighbors
- Clustering, including K-Means and K-Means++
- Model selection
- Preprocessing, including Min-Max Normalization
Build a Machine Learning Model
Learn to build machine learning models with Python.Try it for freeHow to install scikit-learn
Installing scikit-learn is straightforward with Python’s package manager, pip. Before installing, make sure you have NumPy and SciPy installed, as scikit-learn depends on these libraries.
pip install -U scikit-learn
Alternatively, if you’re using Anaconda, you can install scikit-learn using conda:
conda install scikit-learn
To verify your installation, you can import the library in Python:
import sklearnprint(sklearn.__version__)
If the installation was successful, this code will print the version of scikit-learn installed on your system.
Step-by-step: building your first scikit-learn model
Let’s create a machine learning model using scikit-learn. We’ll walk through the complete workflow for building a model:
Step 1: Load a dataset
First, we’ll load the Iris dataset, one of scikit-learn’s built-in datasets:
# Import the necessary libraryfrom sklearn.datasets import load_iris# Load the datasetiris = load_iris()# Store the feature matrix (X) and response vector (y)X = iris.data # Features: sepal length, sepal width, petal length, petal widthy = iris.target # Target: species of iris# Print feature and target names to understand the dataprint("Feature names:", iris.feature_names)print("Target names:", iris.target_names)# Examine the first few rows of the dataprint("\nFirst 5 rows of X:\n", X[:5])
Output:
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']Target names: ['setosa' 'versicolor' 'virginica']First 5 rows of X:[[5.1 3.5 1.4 0.2][4.9 3. 1.4 0.2][4.7 3.2 1.3 0.2][4.6 3.1 1.5 0.2][5. 3.6 1.4 0.2]]
Step 2: Split the dataset
Next, we’ll divide our data into training and testing sets:
from sklearn.model_selection import train_test_split# Split data into 70% training and 30% testingX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# Check the shapes to confirm the splitprint("X_train shape:", X_train.shape)print("X_test shape:", X_test.shape)print("y_train shape:", y_train.shape)print("y_test shape:", y_test.shape)
Output:
X_train shape: (105, 4)X_test shape: (45, 4)y_train shape: (105,)y_test shape: (45,)
Step 3: Train the model
Now we’ll train a K-Nearest Neighbors classifier using our training data:
from sklearn.neighbors import KNeighborsClassifier# Create a K-Nearest Neighbors classifierknn = KNeighborsClassifier(n_neighbors=3)# Train the model using the training setsknn.fit(X_train, y_train)
Step 4: Make predictions
With our trained model, we can now make predictions on the test data:
# Predict the response for test datasety_pred = knn.predict(X_test)# Display the first few predictionsprint("First 5 predictions:", y_pred[:5])print("First 5 actual values:", y_test[:5])
Step 5: Evaluate the model
Finally, we’ll evaluate how well our model performed:
from sklearn import metrics# Check accuracyaccuracy = metrics.accuracy_score(y_test, y_pred)print(f"Accuracy: {accuracy * 100:.2f}%")# Generate a classification reportprint("\nClassification Report:")print(metrics.classification_report(y_test, y_pred, target_names=iris.target_names))# Create a confusion matrixprint("\nConfusion Matrix:")print(metrics.confusion_matrix(y_test, y_pred))
Step 6: Make new predictions
Finally, let’s use our model to predict the species of some new iris flowers:
# Sample data for prediction# Values represent: sepal length, sepal width, petal length, petal widthnew_samples = [[5.1, 3.5, 1.4, 0.2], # Similar to setosa[6.3, 3.3, 6.0, 2.5], # Similar to virginica[5.9, 3.0, 4.2, 1.5]] # Similar to versicolor# Make predictionsnew_predictions = knn.predict(new_samples)# Display resultsfor i, pred in enumerate(new_predictions):print(f"Sample {i+1}: Predicted as {iris.target_names[pred]}")
This example demonstrates the typical machine learning workflow with scikit-learn, from loading data to making new predictions with a trained model.
Key features of Scikit-learn
Scikit-learn stands out among machine learning libraries due to several key features that make it user-friendly and powerful:
Consistent API
Scikit-learn provides a uniform interface where most estimators follow the same pattern:
Initialize:
model = Algorithm(params)
Train:
model.fit(X_train, y_train)
Predict:
y_pred = model.predict(X_test)
Evaluate:
score = model.score(X_test, y_test)
This consistency makes experimenting with different algorithms quick and intuitive.
Wide range of algorithms
Scikit-learn provides implementations of many popular machine learning algorithms, including:
Supervised learning: Linear/logistic regression, decision trees, random forests, SVMs
Unsupervised learning: K-means, hierarchical clustering, PCA, t-SNE
Model selection: Cross-validation, grid search, hyperparameter tuning
Preprocessing capabilities
Data preprocessing is a crucial step in any machine learning pipeline. Scikit-learn offers various tools for:
Feature scaling (StandardScaler, MinMaxScaler)
Encoding (OneHotEncoder, LabelEncoder)
Feature selection and extraction
Missing value handling
Pipeline integration
The Pipeline class allows you to chain multiple preprocessing steps and a final estimator into a single object, making your workflow more organized and less prone to errors like data leakage:
from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionpipe = Pipeline([('scaler', StandardScaler()),('classifier', LogisticRegression())])# Now you can use this pipeline like any other estimatorpipe.fit(X_train, y_train)predictions = pipe.predict(X_test)
Model persistence
Scikit-learn makes it easy to save your trained models for later use:
from sklearn.externals import joblib# Save the modeljoblib.dump(model, 'model.pkl')# Load the modelloaded_model = joblib.load('model.pkl')
Where is Scikit-learn used?
Scikit-learn is widely used across various domains and industries:
Data science and research
Researchers use scikit-learn to prototype models quickly and analyze experimental data. Its accessibility and extensive documentation make it ideal for academic research and publication.
Business and industry
Companies use scikit-learn for:
Customer segmentation and behavior analysis
Demand forecasting and inventory management
Fraud detection and risk assessment
Recommendation systems
Education
Scikit-learn’s simplicity makes it an excellent tool for teaching machine learning concepts, which is why many educational platforms, including Codecademy, use it in their courses.
Prototyping
Even teams that ultimately deploy models using other frameworks often prototype using scikit-learn due to its quick setup and ease of use.
Use cases of Scikit-Learn
Let’s explore some common use cases for scikit-learn with practical examples:
Classification
Classification involves predicting a categorical label. Here’s a simple example using the famous Iris dataset:
from sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifier# Load datairis = load_iris()X, y = iris.data, iris.target# Split dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Train modelclf = RandomForestClassifier(n_estimators=100)clf.fit(X_train, y_train)# Evaluateaccuracy = clf.score(X_test, y_test)print(f"Accuracy: {accuracy:.2f}")
Regression
Regression is used to predict continuous values. Here’s an example using the Boston housing dataset:
from sklearn.datasets import load_bostonfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error# Load databoston = load_boston()X, y = boston.data, boston.target# Split dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Train modelreg = LinearRegression()reg.fit(X_train, y_train)# Predict and evaluatey_pred = reg.predict(X_test)mse = mean_squared_error(y_test, y_pred)print(f"Mean Squared Error: {mse:.2f}")
Clustering
Clustering is an unsupervised learning technique for grouping similar data points:
from sklearn.cluster import KMeansfrom sklearn.datasets import make_blobsimport matplotlib.pyplot as plt# Generate synthetic dataX, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)# Apply K-means clusteringkmeans = KMeans(n_clusters=4)y_kmeans = kmeans.fit_predict(X)# Plot resultsplt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')centers = kmeans.cluster_centers_plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)plt.title('K-means Clustering Results')plt.show()
Dimensionality reduction
PCA (Principal Component Analysis) is commonly used for reducing the number of features while preserving variance:
from sklearn.decomposition import PCAfrom sklearn.datasets import load_digits# Load datadigits = load_digits()X = digits.data# Apply PCApca = PCA(n_components=2)X_reduced = pca.fit_transform(X)# Plot resultsplt.figure(figsize=(10, 8))plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=digits.target, cmap='viridis', alpha=0.5)plt.colorbar()plt.title('PCA of Digits Dataset')plt.show()
Model selection
Scikit-learn provides tools for finding the best hyperparameters:
from sklearn.model_selection import GridSearchCVfrom sklearn.svm import SVC# Define parameter gridparam_grid = {'C': [0.1, 1, 10, 100],'gamma': [0.001, 0.01, 0.1, 1],'kernel': ['rbf', 'linear']}# Set up grid searchgrid = GridSearchCV(SVC(),param_grid,cv=5,scoring='accuracy',verbose=1)# Fit grid searchgrid.fit(X_train, y_train)# Print best parametersprint(f"Best parameters: {grid.best_params_}")print(f"Best cross-validation score: {grid.best_score_:.2f}")
Scikit-Learn vs TensorFlow vs PyTorch
When choosing a machine learning library, it’s important to understand how scikit-learn compares to alternatives:
Feature | Scikit-learn | TensorFlow | PyTorch |
---|---|---|---|
Primary focus | Classical ML algorithms | Deep learning | Deep learning |
Learning curve | Gentle | Steep | Moderate |
Performance with large data | Limited | Excellent | Excellent |
GPU acceleration | Limited | Extensive | Extensive |
Neural network support | Basic | Advanced | Advanced |
Deployment | Simple | Production-ready | Research-friendly |
Community size | Large | Very large | Large and growing |
Ideal use cases | Classical ML, prototyping, tabular data | Production deep learning, deployment | Research, experimentation, flexibility |
When to choose Scikit-Learn
You’re learning machine learning fundamentals
You need quick prototyping with classical algorithms
Your dataset fits in memory
You want a consistent, simple API
You’re working with structured, tabular data
When to choose TensorFlow or PyTorch
TensorFlow: For production-ready deep learning, mobile/edge deployment, or when using TensorFlow Extended (TFX) ecosystem
PyTorch: For research-oriented projects, rapid experimentation, or when dynamic computational graphs are needed
Conclusion
Scikit-learn is an invaluable tool in any data scientist’s toolkit, offering a perfect balance of simplicity and power. Its consistent API, comprehensive documentation, and wide range of algorithms make it an excellent choice for beginners and experienced practitioners alike.
Whether you’re classifying emails, predicting stock prices, segmenting customers, or reducing dimensionality for visualization, scikit-learn provides the tools you need to build effective machine learning models in Python.
Ready to deepen your scikit-learn skills? Explore Codecademy’s Machine Learning with Python course, which covers scikit-learn in depth, from basic concepts to advanced techniques.
Frequently asked questions
1. Is scikit-learn better than TensorFlow?
Neither is inherently “better” – they serve different purposes. Scikit-learn excels at traditional machine learning algorithms with a simple, consistent API, making it ideal for beginners and for quickly prototyping models. TensorFlow specializes in deep learning and neural networks, offering more flexibility and computational power for complex models, especially those requiring GPU acceleration. Choose scikit-learn for classical machine learning tasks and TensorFlow for deep learning projects.
2. What is the difference between sklearn and scikit-learn?
There is no difference – “sklearn” is simply the abbreviation used in Python import statements for the scikit-learn library. When importing the library, you use import sklearn
, but the full name of the project is “scikit-learn.” This naming convention follows Python’s import system requirements, where hyphens aren’t allowed in module names.
3. What are the advantages of sklearn?
Scikit-learn offers several advantages:
Consistency: All algorithms follow the same API pattern
Comprehensive documentation: Extensive examples and tutorials
Integration: Works seamlessly with NumPy, Pandas, and matplotlib
Preprocessing tools: Robust toolset for data preparation
Model selection: Built-in cross-validation and hyperparameter tuning
Active community: Regular updates and responsive support
Low dependencies: Minimal external requirements beyond NumPy and SciPy
4. Is Keras better than sklearn?
Keras and scikit-learn serve different purposes and excel in different areas. Keras is a high-level neural networks API that runs on top of TensorFlow, specializing in deep learning models like convolutional neural networks and recurrent neural networks. Scikit-learn focuses on traditional machine learning algorithms like decision trees, SVMs, and linear models.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
Scikit-Learn Cheatsheet
Explore key Scikit-Learn commands for machine learning, covering regression, classification, clustering, and model validation in Python. - Article
Linear Regression with scikit-learn: A Step-by-Step Guide Using Python
Discover the fundamentals of linear regression and learn how to build linear regression and multiple regression models using the sklearn library in Python. - Article
Building a Neural Network Model Using TensorFlow
Learn how to build a neural network model in TensorFlow by creating a digits classification model using the MNIST dataset.
Learn more on Codecademy
- Skill path
Build a Machine Learning Model
Learn to build machine learning models with Python.Includes 10 CoursesWith CertificateBeginner Friendly23 hours - Free course
Machine Learning: Introduction with Regression
Get started with machine learning and learn how to build, implement, and evaluate linear regression models.Beginner Friendly3 hours - Career path
Machine Learning/AI Engineer
Machine Learning/AI Engineers build end-to-end ML applications and power many of the apps we use every day. They work in Python, Git, & ML.Includes 7 CoursesWith CertificateIntermediate50 hours