MongoDB Aggregation Stages Explained
While analyzing data, we often need to filter, sort, and aggregate it to calculate metrics like sum, count, and average for different values in the dataset. For example, let’s say we want information about the average salary of employees in a particular department of an organization.
To solve this problem, we need to filter data, calculate the average salaries, and sort the data as required. We can perform these operations using different stages in MongoDB aggregation pipelines.
To fully grasp the material covered in this tutorial, you should be familiar with the basics of MongoDB. For this, you can learn the basics of MongoDB with our free course.
Let’s discuss MongoDB aggregation stages such as $match
, $group
, $project
, etc. and how we can use them to extract insights from data.
Dataset Overview and Preparation
To understand how the different stages in MongoDB aggregation pipelines work, we will use a sample employee dataset that contains the following fields:
- employee ID
- employee name
- department ID
- salary
- employee age
To prepare the dataset, we start the MongoDB shell interface using the mongosh
command in a command-line application. We will get a test
prompt like the one shown below:
test>
For our experiments, we will create a new database named codecademy
. For this, we can execute the command use codecademy
.
test> use codecademy;switched to db codecademycodecademy>
After creating the database, we create a collection named employees
that will store the data for the employees using the db.createCollection()
function. The createCollection()
function takes the collection name as its input argument and creates a new collection in the current dataset.
codecademy> db.createCollection("employees");
Next, the data will be inserted into the collection using the insertMany()
method, as shown below:
codecademy> db.employees.insertMany([{ "emp_id": 1, "emp_name": "Ankit", "dept_id": 100, "salary": 10000, "emp_age": 32 }, { "emp_id": 2, "emp_name": "Aditya", "dept_id": 100, "salary": 12000, "emp_age": 25 }, { "emp_id": 3, "emp_name": "Adam", "dept_id": 100, "salary": 13000, "emp_age": 27 }, { "emp_id": 4, "emp_name": "Katy", "dept_id": 100, "salary": 12000, "emp_age": 23 }, { "emp_id": 5, "emp_name": "Akshat", "dept_id": 200, "salary": 15000, "emp_age": 21 }, { "emp_id": 6, "emp_name": "Jasmine", "dept_id": 200, "salary": 14000, "emp_age": 24 }, { "emp_id": 7, "emp_name": "Samantha", "dept_id": 200, "salary": 16000, "emp_age": 27 }, { "emp_id": 8, "emp_name": "Cole", "dept_id": 300, "salary": 20000, "emp_age": 36 }, { "emp_id": 9, "emp_name": "Kunal", "dept_id": 300, "salary": 25000, "emp_age": 39 }, { "emp_id": 10, "emp_name": "Joe", "dept_id": 400, "salary": 15000, "emp_age": 25 }])
After executing the insertMany()
function, we get an output showing that the documents are inserted successfully.
Now that the dataset is ready, let’s understand how each stage in MongoDB aggregation pipelines works.
How do MongoDB Aggregation Stages Work?
MongoDB aggregation pipelines consist of one or more stages that perform specific operations on documents.
A particular stage in the MongoDB pipeline filters data, groups documents, sorts them, or selects specific fields from the documents. The most common stages of MongoDB aggregation pipelines are $match
, $group
, $sort
, and $project
. To combine different stages together and execute the MongoDB pipelines, we use the aggregate()
function.
The aggregate() Function
The aggregate()
function allows us to utilize various aggregation stages and operators to derive insights from the data. It has the following syntax:
db.collection_name.aggregate(
[
stage 1,
stage 2,
stage 3,
…….
stage N
]
)
In the above syntax,
- The keyword
db
refers to the current database. collection_name
is the MongoDB collection in which the data is stored.- Stages 1, 2, 3, and N are stages like
$match
,$group
,$sort
, etc. - Every stage, after its execution, returns a document.
- Stage 1 reads data from the
collection_name
collection as input. The output document from each stage is used as input for the next stage. The final stage’s output is considered the output of the entire aggregation pipeline.
In the aggregate()
function, each stage can use zero or more aggregation operators, such as $sum
, $count
, etc., as required. Except for a few particular stages, a specific stage can be used multiple times in an aggregation pipeline.
How do the different stages and the aggregate()
function work? To understand this, we will discuss the different stages individually.
Different Stages in MongoDB Aggregation Pipelines
MongoDB aggregation pipelines consist of various stages, such as $match
, $group
, $sort
, and $project
.
The $match
Stage
The $match
stage is used to filter data from a MongoDB collection. To use the $match
stage, we use the following syntax.
db.collection_name.aggregate(
[{
$match:{
field_name1:value,
field_name2:value
}
}])
In the above syntax,
- The document passed to the
$match
stage can contain zero or more field names and their desired values. - When we specify the field names and their values in the document passed to the
$match
stage, only the documents with the same values as the input fields are returned as output. If no field names exist in the document, we get all the documents in thecollection_name
collection as the output.
To understand this stage, let’s find all the documents in the employees
collection with the dept_id
value as 300
. As shown below, we will pass the document {"dept_id":300}
to the $match
stage:
codecademy> db.employees.aggregate([{$match:{"dept_id":300}}])
After executing the above code, we get the following output:
[{_id: ObjectId('6636483d3e7fbdb6982202df'),emp_id: 8,emp_name: 'Cole',dept_id: 300,salary: 20000,emp_age: 36},{_id: ObjectId('6636483d3e7fbdb6982202e0'),emp_id: 9,emp_name: 'Kunal',dept_id: 300,salary: 25000,emp_age: 39}]
We can also add multiple field names in the document passed to the $match
stage. For instance, to get all the documents with the dept_id
100 and salary
12000, we will pass the document {"dept_id":100, "salary":12000}
to the $match
stage:
codecademy> db.employees.aggregate([{$match:{"dept_id":100,"salary":12000}}])
The output for the above code is as follows:
[{_id: ObjectId('6636483d3e7fbdb6982202d9'),emp_id: 2,emp_name: 'Aditya',dept_id: 100,salary: 12000,emp_age: 25},{_id: ObjectId('6636483d3e7fbdb6982202db'),emp_id: 4,emp_name: 'Katy',dept_id: 100,salary: 12000,emp_age: 23}]
Apart from filtering data, we might also need to group data using values in specific fields such as age, department, and others. We use the $group
stage in the MongoDB aggregation pipeline for this.
The $group
Stage
We use the $group
stage to group documents based on specific data fields and calculate aggregate values such as sum, average, min, or max for one or more fields.
The syntax for the $group
stage is as follows:
db.collection_name.aggregate(
[
{
$group:{
_id:{
field_name1:"$field_name1",
field_name2:"$field_name2"
},
new_value={ $operator:"$fieldname"}
}
}
])
In the above syntax,
- We pass a document containing any number of field names to the
_id
key. After execution, the$group
stage returns all the documents in the given collection grouped by the values in the provided fields. - When multiple field names are present in the document passed to the
_id
key, each unique set of values in the given field names is considered a separate group. - When we pass the value
null
to the_id
key, the$group
stage outputs all the documents in a single group. new_value
is calculated for each group.- The
$operator
is an aggregation operator like$sum
,$count
,$avg
, etc., andfieldname
is a field name from documents. Using the aggregation operators with the$group
stage is optional.
To understand how the $group
stage works, let’s use the given data to count the number of employees in each department.
For this, we will group the documents in the employees
collection by the dept_id
field by passing the document { dept_id:"$dept_id" }
to the _id
key in the $group
stage.
To count the number of employees in each department, we will use the $count
operator. The $count
operator takes an empty document {}
as its input.
codecademy> db.employees.aggregate([{$group: {_id: {dept_id:"$dept_id"}, number_of_employees:{$count:{}}}}])
The output from the above command is the following:
[{ _id: { dept_id: 100 }, number_of_employees: 4 },{ _id: { dept_id: 200 }, number_of_employees: 3 },{ _id: { dept_id: 400 }, number_of_employees: 1 },{ _id: { dept_id: 300 }, number_of_employees: 2 }]
In the output, you can see that we got four documents, each containing the department ID and the number of employees in that department.
We can also use multiple fields to group the data. For example, to count the number of employees that get a particular salary in each department, we can pass the document { dept_id:"$dept_id", salary:"$salary" }
to the id_
key in the $group
stage:
codecademy> db.employees.aggregate([{$group: {_id: {dept_id:"$dept_id",salary:"$salary"}, number_of_employees:{$count:{}}}}])
Output:
[{ _id: { dept_id: 200, salary: 15000 }, number_of_employees: 1 },{ _id: { dept_id: 200, salary: 16000 }, number_of_employees: 1 },{ _id: { dept_id: 100, salary: 13000 }, number_of_employees: 1 },{ _id: { dept_id: 200, salary: 14000 }, number_of_employees: 1 },{ _id: { dept_id: 300, salary: 25000 }, number_of_employees: 1 },{ _id: { dept_id: 400, salary: 15000 }, number_of_employees: 1 },{ _id: { dept_id: 300, salary: 20000 }, number_of_employees: 1 },{ _id: { dept_id: 100, salary: 10000 }, number_of_employees: 1 },{ _id: { dept_id: 100, salary: 12000 }, number_of_employees: 2 }]
The output shows a document for each unique pair of dept_id
and salary
field values. Hence, the documents are grouped according to the unique set of values in these fields.
Apart from grouping, we might also need to show the output data in a particular order. For this, we use the $sort
stage in MongoDB aggregation pipelines.
The $sort
Stage
The $sort
stage is used to sort the documents by given fields. We use the following syntax for the $sort
stage in an aggregation pipeline;
db.collection_name.aggregate(
[{
$sort:{
field_name1:1,
field_name2:-1
}
}]
)
In the above syntax,
- The document passed to the
$sort
stage contains field names from the dataset as keys with 1 or -1 as their values. - We set 1 as the value for a field name to sort the documents by the field in ascending order.
- To sort the documents in descending order by a given field, we set the value to -1 for the field name.
- When multiple field names are present in the document passed to the
$sort
stage, the output documents from thecollection_name
collection are sorted in the same order as the field names. - For the above syntax, the documents are first sorted by the
field_name1
field in ascending order. If two documents have the same value in thefield_name1
field, they are sorted byfield_name2
field in descending order.
To understand this, we will sort the documents in the employees
collection by salary
in ascending order. To do this, we will pass the document {salary:1}
to the $sort
stage like so:
codecademy> db.employees.aggregate([{$sort: {salary:1}}])
Output:
[{_id: ObjectId('6636483d3e7fbdb6982202d8'),emp_id: 1,emp_name: 'Ankit',dept_id: 100,salary: 10000,emp_age: 32},{_id: ObjectId('6636483d3e7fbdb6982202d9'),emp_id: 2,emp_name: 'Aditya',dept_id: 100,salary: 12000,emp_age: 25},....]
In the output, the documents are presented in the ascending order of their salaries.
We can also sort the documents by two field names. To sort the documents by the salary
field in ascending order and the age
field in descending order, we can pass the document {salary:1, age:-1}
:
codecademy> db.employees.aggregate([{$sort: {salary:1, age:-1}}])
Output:
[{_id: ObjectId('6636483d3e7fbdb6982202d8'),emp_id: 1,emp_name: 'Ankit',dept_id: 100,salary: 10000,emp_age: 32},{_id: ObjectId('6636483d3e7fbdb6982202d9'),emp_id: 2,emp_name: 'Aditya',dept_id: 100,salary: 12000,emp_age: 25},{_id: ObjectId('6636483d3e7fbdb6982202db'),emp_id: 4,emp_name: 'Katy',dept_id: 100,salary: 12000,emp_age: 23},......]
Note the documents are sorted by the salary
field in ascending order in the output. If two documents have the same value in the salary
field, they are sorted by the age
field in descending order.
Until now, the stages’ outputs contain all the fields in the documents. In most cases, we only want to access some fields stored in the data. In such situations, the $project
stage comes to our rescue.
The $project
Stage
In a MongoDB aggregation pipeline, we use the $project
stage to retrieve selected fields from given documents. It has the following syntax:
db.collection_name.aggregate(
[
{
$project:{
field_name1:1,
field_name2:1
}
}
])
In the above syntax,
- To retrieve a particular field from the documents, we set the value to 1 for the corresponding field name. To omit a particular field from the output, we set the value to 0.
- If a field name isn’t mentioned in the document passed to the
$project
stage, the field is not included in the output documents by default.
To understand this, let’s select the names of all the employees in the employees
collection. For this, we will pass the document {emp_name:1}
to the $project
stage, as shown below:
codecademy> db.employees.aggregate([{$project: {emp_name:1}}])
Output:
[{ _id: ObjectId('6636483d3e7fbdb6982202d8'), emp_name: 'Ankit' },{ _id: ObjectId('6636483d3e7fbdb6982202d9'), emp_name: 'Aditya' },{ _id: ObjectId('6636483d3e7fbdb6982202da'), emp_name: 'Adam' },{ _id: ObjectId('6636483d3e7fbdb6982202db'), emp_name: 'Katy' },{ _id: ObjectId('6636483d3e7fbdb6982202dc'), emp_name: 'Akshat' },{ _id: ObjectId('6636483d3e7fbdb6982202dd'), emp_name: 'Jasmine' },{ _id: ObjectId('6636483d3e7fbdb6982202de'), emp_name: 'Samantha' },{ _id: ObjectId('6636483d3e7fbdb6982202df'), emp_name: 'Cole' },{ _id: ObjectId('6636483d3e7fbdb6982202e0'), emp_name: 'Kunal' },{ _id: ObjectId('6636483d3e7fbdb6982202e1'), emp_name: 'Joe' }]
To select multiple fields, we can pass all the field names to the $project
stage. For instance, we can pass the document {emp_name:1, salary:1}
to the $project
stage to select the name
and salary
of the employees.
codecademy> db.employees.aggregate([{$project: {emp_name:1,salary:1}}])
Output:
[{_id: ObjectId('6636483d3e7fbdb6982202d8'),emp_name: 'Ankit',salary: 10000},{_id: ObjectId('6636483d3e7fbdb6982202d9'),emp_name: 'Aditya',salary: 12000},......]
By default, we always get the document’s ID in the output. To exclude this, we can set the _id
key to 0 in the document passed to the $project
stage.
codecademy> db.employees.aggregate([{$project: {emp_name:1,salary:1,_id:0}}])
Notice that we don’t get the _id
field in the output after setting _id
to 0:
[{ emp_name: 'Ankit', salary: 10000 },{ emp_name: 'Aditya', salary: 12000 },{ emp_name: 'Adam', salary: 13000 },{ emp_name: 'Katy', salary: 12000 },{ emp_name: 'Akshat', salary: 15000 },{ emp_name: 'Jasmine', salary: 14000 },{ emp_name: 'Samantha', salary: 16000 },{ emp_name: 'Cole', salary: 20000 },{ emp_name: 'Kunal', salary: 25000 },{ emp_name: 'Joe', salary: 15000 }]
Conclusion
Understanding the different stages in MongoDB aggregation pipelines can help you easily solve analytical questions. Here are some key takeaways from this article:
- MongoDB aggregation pipelines consist of one or more stages that perform specific operations on documents.
- A particular stage in the MongoDB pipeline filters data, groups documents, sorts them, or selects specific fields from the documents.
- To combine different stages together and execute the MongoDB pipelines, we use the
aggregate()
function.
To learn more about solving analytical problems using the discussed concepts, you can read this article on MongoDB aggregation pipelines.
Author
The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.
Meet the full team