Articles

Understanding Convolutional Neural Network (CNN) Architecture

Published Mar 27, 2025Updated Mar 28, 2025
Learn how a convolutional neural network (CNN) works by understanding its components and architecture using examples.

We use artificial neural networks to build deep learning applications for tasks like image recognition, text classification, and speech recognition. For image classification with fewer variations in training data and a small training data size, we can build an image classifier using a feedforward neural network (FNN). However, feedforward neural networks aren’t efficient in handling datasets with large image sizes and variations in input images. To build deep learning applications that can be efficiently trained on datasets with large images and significant variations in the training data, we can use convolutional neural network (CNN) models.

In this article, we will discuss convolutional neural networks (CNNs) and their advantages over feedforward neural network models for image classification tasks. To understand how a CNN model works, we will use examples to discuss its components and their working. We will also discuss the different types of CNN models.

What is a convolutional neural network (CNN)?

A convolutional neural network (CNN) is a type of neural network specifically used to build deep learning applications for image and video processing tasks. Using multiple convolutional layers, CNNs are designed to learn features such as edges, texture, color, and spatial orientation of the objects in the images. This makes CNNs efficient for tasks like image classification and object detection. A usual convolutional neural network model can be depicted using the following diagram:

CNN diagram

As shown in the diagram, a CNN model consists of the following components:

  • Input layer: The input layer takes an image as its input. The input can be a 2D array representing pixel values for greyscale or a 3D array representing color channels like RGB for colored images.
  • Convolutional layers: Convolutional layers are the specialty of CNN models. Each convolutional layer applies a set of filters to detect features from the input images. A CNN layer gives feature maps as its output.
  • Pooling layers: Pooling layers are used to downsample the feature maps. Pooling reduces the size of the feature matrix and makes the CNN model invariant to minor distortions in the input images.
  • Fully connected layers: After multiple convolutional and pooling layers, we flatten the feature maps and pass them to fully connected layers for classification. The fully connected layers usually consist of dense layers with ReLU activation function.
  • Output layer: The output layer consists of a dense layer with the softmax activation function for object detection or classification tasks.

CNNs are different from FNNs because of the convolutional and pooling layers. We need to know how the convolutional layers work to understand how they impact the CNN model’s functioning. Let’s first discuss the different components of a convolutional layer.

Related Course

Deep Learning with TensorFlow: Image Classification

Classify image data with deep learning.Try it for free

Components of a convolutional layer

A convolutional layer consists of components, such as input tensor, filters, stride, activation function, padding, and output feature map. Let’s discuss these components one by one.

Input tensor

The input to the first convolutional layer is the input image. The feature map from the previous convolutional or pooling layer is the input for the subsequent convolutional layers. The input tensor has three dimensions, i.e., height, width, and number of channels. For an RGB image of shape 32x32, the input has the shape 32x32x3, as the number of channels for an RGB image is 3. For greyscale images, the number of channels is 1.

Filters or kernels

A filter or kernel is a small matrix of weights that scans over the input tensor to detect features. The weights of these filters are trainable parameters and are updated during the model training process. Generally, the shape of the filters is 3x3xC, 5x5xC, or 7x7xC, where C is the number of channels in the input. For example, the shape of kernels for processing RGB images will be 3x3x3, 5x5x3, or 7x7x3.

Stride

Stride defines the number of pixels a filter moves in each step during convolution.

Stride example

In this image, the arrows represent the starting position of the filter in each step. The red, green, and blue squares represent the filter positions in steps 1, 2, and 3, respectively. For stride 1, the filter moves from left to right by only one pixel. For stride 2, the filter moves two pixels at a time.

Padding

Depending on the input image size and the filter’s shape, pixels at the rightmost columns and bottom rows can be left out during convolution and pooling. For example, if we have an image with 12 columns and use a 3x3 filter with stride 2, as shown in the stride example, the rightmost column won’t be included in the calculations. Similarly, the bottom rows can also be excluded if the filter can’t fit onto those rows.

Sometimes, the excluded rows or columns have significant features we might not want to lose. In such situations, padding adds extra pixels around the input image so that the filters can fit onto the outer rows and columns. The pixels added while padding contain no information.

For example, we can add one padding pixel to each side of the matrix with a 12x9 shape, making the shape 14x11, as shown below:

Padding example

After padding, we can process all the pixels from the original input matrix using a 3x3 filter with stride 2, ensuring no information is lost.

Activation function

The CNN layer uses activation functions to introduce non-linearity to the outputs after convolution. Usually, we use the ReLU activation function in the CNN layers. The ReLU activation function replaces the negative values in the input matrix with zero, and the non-negative values remain unaffected, as shown below:

ReLU example

In this example, you can observe that the input matrix contains some negative values. After applying the ReLU activation function, the negative values are replaced by zero.

Output feature map

After processing the input through the convolution operation and the ReLU activation function, the convolutional layer gives a feature map as output. If a convolutional layer has N filters, we get N feature maps as output.

We have discussed the different components of a convolutional layer. Now, let’s discuss how these components, combined with pooling and dense layers, facilitate image classification tasks.

How do convolutional neural network models work?

CNN models first process the input images using convolutional and pooling layers to get the feature maps. Then, the model flattens the feature maps and passes them through fully connected layers for image classification or object detection tasks. This whole process can be understood using the following steps:

Step 1: Process the input through the convolutional layer

When the model gets an array representing the input image, it performs the convolution operation on the image. In the convolution operation, the model first places the filter on the leftmost corner of the input image. After placing the filter on the image, the model multiplies the overlapping digits in the filter, and the image takes their sum and stores it in a separate image. For instance, consider the following convolution example:

Convolution example

In this image, we have an input matrix of shape 12x9 and a filter of shape 3x3. The red square at the left-top corner of the input matrix denotes the filter’s position at the start of the convolution operation. Now, we multiply the overlapping digits in the filter and the image and take the sum, as shown below:

1x1+0x2+1x1+2x3+1x0+2x(-2)+3x1+2x3+1x2=15

Now, we want to perform convolution with stride 1. Hence, we will slide the filter to the right by 1 column (at the green square), multiply the overlapping digits, and find their sum:

0x1+1x2+2x1+1x3+2x0+0x(-2)+2x1+1x3+1x2=14

Again, we move the filter to the right by 1 column and continue this operation. Once the calculation in the first row is done, we move the filter to the first column and second row (blue square), multiply the overlapping digits, and find their sum. After scanning this row, we start with the third row. This process continues till the entire input image is scanned.

After the convolution operation, we get a matrix representing the feature map of the particular image for the given filter. If we have N filters in a CNN layer, we perform the convolution of each filter with the input and get N feature maps.

The convolutional layer applies the ReLU activation function to each feature map to convert non-negative values to zero.

Step 2: Downsampling (pooling)

To reduce the spatial dimensions of the feature maps after convolution, we use a downsampling technique called max pooling. In max pooling, we scan the feature map with a pooling window of shape 2x2 or 3x3 and select the maximum value in the pooling window region.

The model first selects a pooling window at the leftmost corner of the feature map. After choosing the region, the model stores the maximum value in the pooling window in a different array. To understand this, consider the following example:

Max pooling example

In this image, we have an input matrix of shape 12x9 and a pooling window of shape 3x3. The red square at the left top of the input matrix denotes the position of the pooling window at the start of the max pooling operation. We take the maximum value in the red square region, i.e., 26, and store it in a different matrix.

If we want to perform max pooling with stride 2, we will slide the pooling window to the right by two columns (at the green square) and select the maximum value in the green region, i.e., 22. Next, we will move the window two columns to the right (at the indigo square) and take the region’s maximum value, i.e., 23. Once the calculation in the first row is done, we move the pooling window to the first column and third row (blue square) and take the maximum value in the pooling window. After scanning this row, we start with the fifth row. This process continues till the entire feature map is scanned.

As you can see, max pooling reduces the spatial dimension of the feature map, resulting in computational efficiency. It also helps the model preserve sharp features like edges and textures and removes unnecessary details by taking the maximum value in each window. Since pooling removes unnecessary details, it also acts as a regularization technique, leading to better generalization and making the CNN model robust to minor spatial variations.

In a CNN model, there can be any number of convolutional and pooling layers. Each convolutional layer takes the output feature map from the previous layer, performs the convolution operation, and passes the resultant feature map to the next layer.

Step 3: Flatten the feature map and perform image classification or object detection

After multiple convolution and pooling operations, the feature maps are flattened and passed to fully connected dense layers for image classification or object detection tasks. Finally, we get the CNN model’s output from the output layer.

To put everything together, let’s discuss how a CNN model with three convolutional layers performs the digit classification task.

CNN model working example

We can understand the working of a CNN model using the following model for digit classification:

Digit classification with cnn example

In this model,

  • The first convolutional layer takes the image of the digit as its input and performs convolution with different filters to detect features like edges, corners, and blobs. After applying the ReLU activation function, the convolutional layer passes the feature maps to the pooling layer, which performs max pooling operation on the feature maps and passes the output to the next convolutional layer.
  • The second convolutional layer performs convolution with filters to detect features like strokes, curves, and intersections. After applying the ReLU activation function, it passes the feature maps to the next pooling layer, which performs max pooling operation on them and passes them to the third convolutional layer.
  • The third convolutional layer performs convolution with different filters to detect more abstract features like parts of digits. After applying the ReLU activation function, it passes the feature maps to the next pooling layer, which performs max pooling on the feature maps.
  • After the third pooling layer, all the feature maps are flattened and converted into a 1D array. The array is then given as input to the fully connected dense layers for the digit classification task.
  • Finally, the output layer gives the probability of an image being a particular digit as its output.

Till now, we discussed the different components of a CNN layer and how a CNN model works for image classification or object detection tasks. Instead of a CNN model, we can build an image classifier using a feedforward neural network model. Then, why do we need a convolutional neural network?

Let’s discuss some of the reasons why CNN layers are best suited for image classification tasks.

Advantages of CNNs over FNNs

Suppose we have to build an image classification model for a dataset with colored 1080p images of shape 1920x1080. For each pixel position in the images, we will have an array of three values containing RGB values for the pixels. If we use a feedforward neural network model to classify the images in the dataset, the input layer will have 6,220,800 neurons.

Now, even if we have 256, 128, and 64 neurons in the three hidden layers and 10 neurons in the output layer in the model, we will have a neural network containing almost 1.6 billion parameters to train. Even if we choose only two hidden layers with 128 and 64 neurons, we will have to train 802 million parameters, which is not feasible. Hence, we use CNNs to build image classification models.

CNNs have multiple advantages over simple feedforward neural networks for image classification tasks, as discussed below:

  • Reduced number of parameters: Feedforward neural network models use fully connected layers, leading to a huge number of parameters. Due to this, FNN models become inefficient for high-dimensional inputs. Convolutional layers in a CNN model use shared weights in the form of kernels/filters to scan different parts of an image to detect features. This leads to a massive reduction in the number of parameters in the model, resulting in computational efficiency.
  • Translational invariance: If the position of objects in the inputs changes, FNN models cannot process them accurately. This is because FNN models process images as a flattened vector and cannot detect spatial features in the images. CNNs can recognize patterns in an image regardless of their position in the image, as the filters in the convolutional layers scan the entire image to detect features.
  • Spatial hierarchy and locality: FNNs treat all the pixels in the input images equally, which makes them unable to recognize local features like edges and textures. CNNs preserve spatial relationships between the pixels using convolutional layers to capture local patterns.
  • Dimensionality Reduction: CNNs use pooling layers to reduce the feature map size while preserving dominant patterns. This prevents overfitting and improves computational efficiency.
  • Generalization: To detect patterns from rotated or transformed images using FNN models, we need extensive data augmentation and manual feature engineering. On the flip side, hierarchical extraction of features in CNNs makes the models robust to variations in the input images and saves us time spent in manual feature engineering.

We have discussed the generic architecture of CNN models, how they work, and their advantages over FNNs; we can use different types of CNNs for specific use cases. Let’s discuss the various types of CNNs.

Types of convolutional neural networks (CNNs)

Based on the use cases and architecture, CNNs can be classified into multiple types:

  • LeNet: LeNet was designed for the digits recognition task. It had a total of five layers with two convolutional layers. The LeNet CNN uses max pooling for downsampling and sigmoid/tanh activation function in the convolutional layer.
  • AlexNet: AlexNet had five convolutional layers and three pooling layers, followed by fully connected layers. The AlexNet model uses overlapping max pooling for downsampling and the ReLU activation function in convolutional layers. It also introduced the dropout layer for regularization to reduce overfitting.
  • ResNet: ResNet was built for large-scale image recognition tasks. It can have a large number of convolutional layers without overfitting due to the use of skip connections that allow the model to learn residual functions. The skip connections also solve the vanishing gradient problem by allowing the gradients to flow directly through the network.
  • VGGNet: VGGNet uses 3×3 convolutional filters stacked to increase depth, which creates a deep and uniform structure. Due to the large number of parameters, it has a very high training cost.

Conclusion

CNNs are excellent tools for object detection and image classification tasks. They have huge applications in autonomous driving, medical imaging, and IoT devices like AI-powered cameras that use facial recognition. In this article, we discussed the architecture and workings of a CNN model using examples. We also discussed the different types of CNN models and why CNNs are best suited for image classification and object detection tasks.

To learn more about CNNs, you can take this course on building an image classification model using CNN. You can also go through this skill path on building deep learning models with tensorflow that will help you create, train, and test deep learning models using TensorFlow and understand real-world applications of deep learning.

Happy learning!

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team