Articles

Scikit-Learn Tutorial: Python Machine Learning Model Building

Learn how to build powerful machine learning models with scikit-learn in Python. Master essential techniques from installation to implementation with practical examples and comparisons.

What is scikit-learn?

Scikit-learn (often shortened to “sklearn”) is a free, open-source machine learning library for Python that provides simple and efficient tools for data analysis and modeling. Built on NumPy, SciPy, and matplotlib, scikit-learn has become the go-to library for many data scientists and machine learning practitioners. Whether you’re a beginner just starting your machine learning journey or an experienced practitioner looking for reliable implementations, scikit-learn offers a consistent interface that makes experimenting with different algorithms straightforward and accessible.

The library was initially developed by David Cournapeau as part of a Google Summer of Code project in 2007. Since then, it has grown into a robust ecosystem maintained by a diverse community of contributors worldwide. The name “scikit-learn” comes from the fact that it’s a “SciKit” (SciPy Toolkit), an add-on package for SciPy, focusing specifically on machine learning algorithms.

The functionality that scikit-learn provides includes:

  • Regression, including Linear and Logistic Regression
  • Classification, including K-Nearest Neighbors
  • Clustering, including K-Means and K-Means++
  • Model selection
  • Preprocessing, including Min-Max Normalization
Related Course

Build a Machine Learning Model

Learn to build machine learning models with Python.Try it for free

How to install scikit-learn

Installing scikit-learn is straightforward with Python’s package manager, pip. Before installing, make sure you have NumPy and SciPy installed, as scikit-learn depends on these libraries.

pip install -U scikit-learn

Alternatively, if you’re using Anaconda, you can install scikit-learn using conda:

conda install scikit-learn

To verify your installation, you can import the library in Python:

import sklearn
print(sklearn.__version__)

If the installation was successful, this code will print the version of scikit-learn installed on your system.

Step-by-step: building your first scikit-learn model

Let’s create a machine learning model using scikit-learn. We’ll walk through the complete workflow for building a model:

Step 1: Load a dataset

First, we’ll load the Iris dataset, one of scikit-learn’s built-in datasets:

# Import the necessary library
from sklearn.datasets import load_iris
# Load the dataset
iris = load_iris()
# Store the feature matrix (X) and response vector (y)
X = iris.data # Features: sepal length, sepal width, petal length, petal width
y = iris.target # Target: species of iris
# Print feature and target names to understand the data
print("Feature names:", iris.feature_names)
print("Target names:", iris.target_names)
# Examine the first few rows of the data
print("\nFirst 5 rows of X:\n", X[:5])

Output:

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
First 5 rows of X:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]

Step 2: Split the dataset

Next, we’ll divide our data into training and testing sets:

from sklearn.model_selection import train_test_split
# Split data into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Check the shapes to confirm the split
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

Output:

X_train shape: (105, 4)
X_test shape: (45, 4)
y_train shape: (105,)
y_test shape: (45,)

Step 3: Train the model

Now we’ll train a K-Nearest Neighbors classifier using our training data:

from sklearn.neighbors import KNeighborsClassifier
# Create a K-Nearest Neighbors classifier
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model using the training sets
knn.fit(X_train, y_train)

Step 4: Make predictions

With our trained model, we can now make predictions on the test data:

# Predict the response for test dataset
y_pred = knn.predict(X_test)
# Display the first few predictions
print("First 5 predictions:", y_pred[:5])
print("First 5 actual values:", y_test[:5])

Step 5: Evaluate the model

Finally, we’ll evaluate how well our model performed:

from sklearn import metrics
# Check accuracy
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
# Generate a classification report
print("\nClassification Report:")
print(metrics.classification_report(y_test, y_pred, target_names=iris.target_names))
# Create a confusion matrix
print("\nConfusion Matrix:")
print(metrics.confusion_matrix(y_test, y_pred))

Step 6: Make new predictions

Finally, let’s use our model to predict the species of some new iris flowers:

# Sample data for prediction
# Values represent: sepal length, sepal width, petal length, petal width
new_samples = [[5.1, 3.5, 1.4, 0.2], # Similar to setosa
[6.3, 3.3, 6.0, 2.5], # Similar to virginica
[5.9, 3.0, 4.2, 1.5]] # Similar to versicolor
# Make predictions
new_predictions = knn.predict(new_samples)
# Display results
for i, pred in enumerate(new_predictions):
print(f"Sample {i+1}: Predicted as {iris.target_names[pred]}")

This example demonstrates the typical machine learning workflow with scikit-learn, from loading data to making new predictions with a trained model.

Key features of Scikit-learn

Scikit-learn stands out among machine learning libraries due to several key features that make it user-friendly and powerful:

Consistent API

Scikit-learn provides a uniform interface where most estimators follow the same pattern:

  • Initialize: model = Algorithm(params)

  • Train: model.fit(X_train, y_train)

  • Predict: y_pred = model.predict(X_test)

  • Evaluate: score = model.score(X_test, y_test)

This consistency makes experimenting with different algorithms quick and intuitive.

Wide range of algorithms

Scikit-learn provides implementations of many popular machine learning algorithms, including:

  • Supervised learning: Linear/logistic regression, decision trees, random forests, SVMs

  • Unsupervised learning: K-means, hierarchical clustering, PCA, t-SNE

  • Model selection: Cross-validation, grid search, hyperparameter tuning

Preprocessing capabilities

Data preprocessing is a crucial step in any machine learning pipeline. Scikit-learn offers various tools for:

  • Feature scaling (StandardScaler, MinMaxScaler)

  • Encoding (OneHotEncoder, LabelEncoder)

  • Feature selection and extraction

  • Missing value handling

Pipeline integration

The Pipeline class allows you to chain multiple preprocessing steps and a final estimator into a single object, making your workflow more organized and less prone to errors like data leakage:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
# Now you can use this pipeline like any other estimator
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

Model persistence

Scikit-learn makes it easy to save your trained models for later use:

from sklearn.externals import joblib
# Save the model
joblib.dump(model, 'model.pkl')
# Load the model
loaded_model = joblib.load('model.pkl')

Where is Scikit-learn used?

Scikit-learn is widely used across various domains and industries:

Data science and research

Researchers use scikit-learn to prototype models quickly and analyze experimental data. Its accessibility and extensive documentation make it ideal for academic research and publication.

Business and industry

Companies use scikit-learn for:

  • Customer segmentation and behavior analysis

  • Demand forecasting and inventory management

  • Fraud detection and risk assessment

  • Recommendation systems

Education

Scikit-learn’s simplicity makes it an excellent tool for teaching machine learning concepts, which is why many educational platforms, including Codecademy, use it in their courses.

Prototyping

Even teams that ultimately deploy models using other frameworks often prototype using scikit-learn due to its quick setup and ease of use.

Use cases of Scikit-Learn

Let’s explore some common use cases for scikit-learn with practical examples:

Classification

Classification involves predicting a categorical label. Here’s a simple example using the famous Iris dataset:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
# Evaluate
accuracy = clf.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

Regression

Regression is used to predict continuous values. Here’s an example using the Boston housing dataset:

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load data
boston = load_boston()
X, y = boston.data, boston.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
reg = LinearRegression()
reg.fit(X_train, y_train)
# Predict and evaluate
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

Clustering

Clustering is an unsupervised learning technique for grouping similar data points:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Apply K-means clustering
kmeans = KMeans(n_clusters=4)
y_kmeans = kmeans.fit_predict(X)
# Plot results
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
plt.title('K-means Clustering Results')
plt.show()

Dimensionality reduction

PCA (Principal Component Analysis) is commonly used for reducing the number of features while preserving variance:

from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
# Load data
digits = load_digits()
X = digits.data
# Apply PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# Plot results
plt.figure(figsize=(10, 8))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=digits.target, cmap='viridis', alpha=0.5)
plt.colorbar()
plt.title('PCA of Digits Dataset')
plt.show()

Model selection

Scikit-learn provides tools for finding the best hyperparameters:

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
# Define parameter grid
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1],
'kernel': ['rbf', 'linear']
}
# Set up grid search
grid = GridSearchCV(
SVC(),
param_grid,
cv=5,
scoring='accuracy',
verbose=1
)
# Fit grid search
grid.fit(X_train, y_train)
# Print best parameters
print(f"Best parameters: {grid.best_params_}")
print(f"Best cross-validation score: {grid.best_score_:.2f}")

Scikit-Learn vs TensorFlow vs PyTorch

When choosing a machine learning library, it’s important to understand how scikit-learn compares to alternatives:

Feature Scikit-learn TensorFlow PyTorch
Primary focus Classical ML algorithms Deep learning Deep learning
Learning curve Gentle Steep Moderate
Performance with large data Limited Excellent Excellent
GPU acceleration Limited Extensive Extensive
Neural network support Basic Advanced Advanced
Deployment Simple Production-ready Research-friendly
Community size Large Very large Large and growing
Ideal use cases Classical ML, prototyping, tabular data Production deep learning, deployment Research, experimentation, flexibility

When to choose Scikit-Learn

  • You’re learning machine learning fundamentals

  • You need quick prototyping with classical algorithms

  • Your dataset fits in memory

  • You want a consistent, simple API

  • You’re working with structured, tabular data

When to choose TensorFlow or PyTorch

  • TensorFlow: For production-ready deep learning, mobile/edge deployment, or when using TensorFlow Extended (TFX) ecosystem

  • PyTorch: For research-oriented projects, rapid experimentation, or when dynamic computational graphs are needed

Conclusion

Scikit-learn is an invaluable tool in any data scientist’s toolkit, offering a perfect balance of simplicity and power. Its consistent API, comprehensive documentation, and wide range of algorithms make it an excellent choice for beginners and experienced practitioners alike.

Whether you’re classifying emails, predicting stock prices, segmenting customers, or reducing dimensionality for visualization, scikit-learn provides the tools you need to build effective machine learning models in Python.

Ready to deepen your scikit-learn skills? Explore Codecademy’s Machine Learning with Python course, which covers scikit-learn in depth, from basic concepts to advanced techniques.

Frequently asked questions

1. Is scikit-learn better than TensorFlow?

Neither is inherently “better” – they serve different purposes. Scikit-learn excels at traditional machine learning algorithms with a simple, consistent API, making it ideal for beginners and for quickly prototyping models. TensorFlow specializes in deep learning and neural networks, offering more flexibility and computational power for complex models, especially those requiring GPU acceleration. Choose scikit-learn for classical machine learning tasks and TensorFlow for deep learning projects.

2. What is the difference between sklearn and scikit-learn?

There is no difference – “sklearn” is simply the abbreviation used in Python import statements for the scikit-learn library. When importing the library, you use import sklearn, but the full name of the project is “scikit-learn.” This naming convention follows Python’s import system requirements, where hyphens aren’t allowed in module names.

3. What are the advantages of sklearn?

Scikit-learn offers several advantages:

  • Consistency: All algorithms follow the same API pattern

  • Comprehensive documentation: Extensive examples and tutorials

  • Integration: Works seamlessly with NumPy, Pandas, and matplotlib

  • Preprocessing tools: Robust toolset for data preparation

  • Model selection: Built-in cross-validation and hyperparameter tuning

  • Active community: Regular updates and responsive support

  • Low dependencies: Minimal external requirements beyond NumPy and SciPy

4. Is Keras better than sklearn?

Keras and scikit-learn serve different purposes and excel in different areas. Keras is a high-level neural networks API that runs on top of TensorFlow, specializing in deep learning models like convolutional neural networks and recurrent neural networks. Scikit-learn focuses on traditional machine learning algorithms like decision trees, SVMs, and linear models.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team