MongoDB Aggregation Stages Explained

Codecademy Team
Learn MongoDB Aggregation Stages

While analyzing data, we often need to filter, sort, and aggregate it to calculate metrics like sum, count, and average for different values in the dataset. For example, let’s say we want information about the average salary of employees in a particular department of an organization.

To solve this problem, we need to filter data, calculate the average salaries, and sort the data as required. We can perform these operations using different stages in MongoDB aggregation pipelines.

To fully grasp the material covered in this tutorial, you should be familiar with the basics of MongoDB. For this, you can learn the basics of MongoDB with our free course.

Let’s discuss MongoDB aggregation stages such as $match, $group, $project, etc. and how we can use them to extract insights from data.

Dataset Overview and Preparation

To understand how the different stages in MongoDB aggregation pipelines work, we will use a sample employee dataset that contains the following fields:

  • employee ID
  • employee name
  • department ID
  • salary
  • employee age

To prepare the dataset, we start the MongoDB shell interface using the mongosh command in a command-line application. We will get a test prompt like the one shown below:

test>

For our experiments, we will create a new database named codecademy. For this, we can execute the command use codecademy.

test> use codecademy;
switched to db codecademy
codecademy>

After creating the database, we create a collection named employees that will store the data for the employees using the db.createCollection() function. The createCollection() function takes the collection name as its input argument and creates a new collection in the current dataset.

codecademy> db.createCollection("employees");

Next, the data will be inserted into the collection using the insertMany() method, as shown below:

codecademy> db.employees.insertMany([{ "emp_id": 1, "emp_name": "Ankit", "dept_id": 100, "salary": 10000, "emp_age": 32 }, { "emp_id": 2, "emp_name": "Aditya", "dept_id": 100, "salary": 12000, "emp_age": 25 }, { "emp_id": 3, "emp_name": "Adam", "dept_id": 100, "salary": 13000, "emp_age": 27 }, { "emp_id": 4, "emp_name": "Katy", "dept_id": 100, "salary": 12000, "emp_age": 23 }, { "emp_id": 5, "emp_name": "Akshat", "dept_id": 200, "salary": 15000, "emp_age": 21 }, { "emp_id": 6, "emp_name": "Jasmine", "dept_id": 200, "salary": 14000, "emp_age": 24 }, { "emp_id": 7, "emp_name": "Samantha", "dept_id": 200, "salary": 16000, "emp_age": 27 }, { "emp_id": 8, "emp_name": "Cole", "dept_id": 300, "salary": 20000, "emp_age": 36 }, { "emp_id": 9, "emp_name": "Kunal", "dept_id": 300, "salary": 25000, "emp_age": 39 }, { "emp_id": 10, "emp_name": "Joe", "dept_id": 400, "salary": 15000, "emp_age": 25 }])

After executing the insertMany() function, we get an output showing that the documents are inserted successfully.

Now that the dataset is ready, let’s understand how each stage in MongoDB aggregation pipelines works.

How do MongoDB Aggregation Stages Work?

MongoDB aggregation pipelines consist of one or more stages that perform specific operations on documents.

A particular stage in the MongoDB pipeline filters data, groups documents, sorts them, or selects specific fields from the documents. The most common stages of MongoDB aggregation pipelines are $match, $group, $sort, and $project. To combine different stages together and execute the MongoDB pipelines, we use the aggregate() function.

The aggregate() Function

The aggregate() function allows us to utilize various aggregation stages and operators to derive insights from the data. It has the following syntax:

db.collection_name.aggregate(  
[  
  stage 1,  
  stage 2,   
  stage 3,  
  …….  
  stage N  
]  
)  

In the above syntax,

  • The keyword db refers to the current database.
  • collection_name is the MongoDB collection in which the data is stored.
  • Stages 1, 2, 3, and N are stages like $match, $group, $sort, etc.
  • Every stage, after its execution, returns a document.
  • Stage 1 reads data from the collection_name collection as input. The output document from each stage is used as input for the next stage. The final stage’s output is considered the output of the entire aggregation pipeline.

In the aggregate() function, each stage can use zero or more aggregation operators, such as $sum, $count, etc., as required. Except for a few particular stages, a specific stage can be used multiple times in an aggregation pipeline.

How do the different stages and the aggregate() function work? To understand this, we will discuss the different stages individually.

Different Stages in MongoDB Aggregation Pipelines

MongoDB aggregation pipelines consist of various stages, such as $match, $group, $sort, and $project.

The $match Stage

The $match stage is used to filter data from a MongoDB collection. To use the $match stage, we use the following syntax.

db.collection_name.aggregate(  

[{  
  $match:{  
    field_name1:value,  
    field_name2:value  
  }  
}])  

In the above syntax,

  • The document passed to the $match stage can contain zero or more field names and their desired values.
  • When we specify the field names and their values in the document passed to the $match stage, only the documents with the same values as the input fields are returned as output. If no field names exist in the document, we get all the documents in the collection_name collection as the output.

To understand this stage, let’s find all the documents in the employees collection with the dept_id value as 300. As shown below, we will pass the document {"dept_id":300} to the $match stage:

codecademy> db.employees.aggregate([{$match:{"dept_id":300}}])

After executing the above code, we get the following output:

[
{
_id: ObjectId('6636483d3e7fbdb6982202df'),
emp_id: 8,
emp_name: 'Cole',
dept_id: 300,
salary: 20000,
emp_age: 36
},
{
_id: ObjectId('6636483d3e7fbdb6982202e0'),
emp_id: 9,
emp_name: 'Kunal',
dept_id: 300,
salary: 25000,
emp_age: 39
}
]

We can also add multiple field names in the document passed to the $match stage. For instance, to get all the documents with the dept_id 100 and salary 12000, we will pass the document {"dept_id":100, "salary":12000} to the $match stage:

codecademy> db.employees.aggregate([{$match:{"dept_id":100,"salary":12000}}])

The output for the above code is as follows:

[
{
_id: ObjectId('6636483d3e7fbdb6982202d9'),
emp_id: 2,
emp_name: 'Aditya',
dept_id: 100,
salary: 12000,
emp_age: 25
},
{
_id: ObjectId('6636483d3e7fbdb6982202db'),
emp_id: 4,
emp_name: 'Katy',
dept_id: 100,
salary: 12000,
emp_age: 23
}
]

Apart from filtering data, we might also need to group data using values in specific fields such as age, department, and others. We use the $group stage in the MongoDB aggregation pipeline for this.

The $group Stage

We use the $group stage to group documents based on specific data fields and calculate aggregate values such as sum, average, min, or max for one or more fields.

The syntax for the $group stage is as follows:

db.collection_name.aggregate(  
[  
{  
 $group:{  
  _id:{  
    field_name1:"$field_name1",  
    field_name2:"$field_name2"	  
  },  
    new_value={ $operator:"$fieldname"}  
  }  
}  
])  

In the above syntax,

  • We pass a document containing any number of field names to the _id key. After execution, the $group stage returns all the documents in the given collection grouped by the values in the provided fields.
  • When multiple field names are present in the document passed to the _id key, each unique set of values in the given field names is considered a separate group.
  • When we pass the value null to the _id key, the $group stage outputs all the documents in a single group.
  • new_value is calculated for each group.
  • The $operator is an aggregation operator like $sum, $count, $avg, etc., and fieldname is a field name from documents. Using the aggregation operators with the $group stage is optional.

To understand how the $group stage works, let’s use the given data to count the number of employees in each department.

For this, we will group the documents in the employees collection by the dept_id field by passing the document { dept_id:"$dept_id" } to the _id key in the $group stage.

To count the number of employees in each department, we will use the $count operator. The $count operator takes an empty document {} as its input.

codecademy> db.employees.aggregate([{$group: {_id: {dept_id:"$dept_id"}, number_of_employees:{$count:{}}}}])

The output from the above command is the following:

[
{ _id: { dept_id: 100 }, number_of_employees: 4 },
{ _id: { dept_id: 200 }, number_of_employees: 3 },
{ _id: { dept_id: 400 }, number_of_employees: 1 },
{ _id: { dept_id: 300 }, number_of_employees: 2 }
]

In the output, you can see that we got four documents, each containing the department ID and the number of employees in that department.

We can also use multiple fields to group the data. For example, to count the number of employees that get a particular salary in each department, we can pass the document { dept_id:"$dept_id", salary:"$salary" } to the id_ key in the $group stage:

codecademy> db.employees.aggregate([{$group: {_id: {dept_id:"$dept_id",salary:"$salary"}, number_of_employees:{$count:{}}}}])

Output:

[
{ _id: { dept_id: 200, salary: 15000 }, number_of_employees: 1 },
{ _id: { dept_id: 200, salary: 16000 }, number_of_employees: 1 },
{ _id: { dept_id: 100, salary: 13000 }, number_of_employees: 1 },
{ _id: { dept_id: 200, salary: 14000 }, number_of_employees: 1 },
{ _id: { dept_id: 300, salary: 25000 }, number_of_employees: 1 },
{ _id: { dept_id: 400, salary: 15000 }, number_of_employees: 1 },
{ _id: { dept_id: 300, salary: 20000 }, number_of_employees: 1 },
{ _id: { dept_id: 100, salary: 10000 }, number_of_employees: 1 },
{ _id: { dept_id: 100, salary: 12000 }, number_of_employees: 2 }
]

The output shows a document for each unique pair of dept_id and salary field values. Hence, the documents are grouped according to the unique set of values in these fields.

Apart from grouping, we might also need to show the output data in a particular order. For this, we use the $sort stage in MongoDB aggregation pipelines.

The $sort Stage

The $sort stage is used to sort the documents by given fields. We use the following syntax for the $sort stage in an aggregation pipeline;

db.collection_name.aggregate(  
[{  
  $sort:{  
    field_name1:1,  
    field_name2:-1
  }  
}]  
)  

In the above syntax,

  • The document passed to the $sort stage contains field names from the dataset as keys with 1 or -1 as their values.
  • We set 1 as the value for a field name to sort the documents by the field in ascending order.
  • To sort the documents in descending order by a given field, we set the value to -1 for the field name.
  • When multiple field names are present in the document passed to the $sort stage, the output documents from the collection_name collection are sorted in the same order as the field names.
  • For the above syntax, the documents are first sorted by the field_name1 field in ascending order. If two documents have the same value in the field_name1 field, they are sorted by field_name2 field in descending order.

To understand this, we will sort the documents in the employees collection by salary in ascending order. To do this, we will pass the document {salary:1} to the $sort stage like so:

codecademy> db.employees.aggregate([{$sort: {salary:1}}])

Output:

[
{
_id: ObjectId('6636483d3e7fbdb6982202d8'),
emp_id: 1,
emp_name: 'Ankit',
dept_id: 100,
salary: 10000,
emp_age: 32
},
{
_id: ObjectId('6636483d3e7fbdb6982202d9'),
emp_id: 2,
emp_name: 'Aditya',
dept_id: 100,
salary: 12000,
emp_age: 25
},
....
]

In the output, the documents are presented in the ascending order of their salaries.

We can also sort the documents by two field names. To sort the documents by the salary field in ascending order and the age field in descending order, we can pass the document {salary:1, age:-1}:

codecademy> db.employees.aggregate([{$sort: {salary:1, age:-1}}])

Output:

[
{
_id: ObjectId('6636483d3e7fbdb6982202d8'),
emp_id: 1,
emp_name: 'Ankit',
dept_id: 100,
salary: 10000,
emp_age: 32
},
{
_id: ObjectId('6636483d3e7fbdb6982202d9'),
emp_id: 2,
emp_name: 'Aditya',
dept_id: 100,
salary: 12000,
emp_age: 25
},
{
_id: ObjectId('6636483d3e7fbdb6982202db'),
emp_id: 4,
emp_name: 'Katy',
dept_id: 100,
salary: 12000,
emp_age: 23
},
......
]

Note the documents are sorted by the salary field in ascending order in the output. If two documents have the same value in the salary field, they are sorted by the age field in descending order.

Until now, the stages’ outputs contain all the fields in the documents. In most cases, we only want to access some fields stored in the data. In such situations, the $project stage comes to our rescue.

The $project Stage

In a MongoDB aggregation pipeline, we use the $project stage to retrieve selected fields from given documents. It has the following syntax:

db.collection_name.aggregate( 
[  
{  
  $project:{  
    field_name1:1,  
    field_name2:1  
  }  
}  
])  

In the above syntax,

  • To retrieve a particular field from the documents, we set the value to 1 for the corresponding field name. To omit a particular field from the output, we set the value to 0.
  • If a field name isn’t mentioned in the document passed to the $project stage, the field is not included in the output documents by default.

To understand this, let’s select the names of all the employees in the employees collection. For this, we will pass the document {emp_name:1} to the $project stage, as shown below:

codecademy> db.employees.aggregate([{$project: {emp_name:1}}])

Output:

[
{ _id: ObjectId('6636483d3e7fbdb6982202d8'), emp_name: 'Ankit' },
{ _id: ObjectId('6636483d3e7fbdb6982202d9'), emp_name: 'Aditya' },
{ _id: ObjectId('6636483d3e7fbdb6982202da'), emp_name: 'Adam' },
{ _id: ObjectId('6636483d3e7fbdb6982202db'), emp_name: 'Katy' },
{ _id: ObjectId('6636483d3e7fbdb6982202dc'), emp_name: 'Akshat' },
{ _id: ObjectId('6636483d3e7fbdb6982202dd'), emp_name: 'Jasmine' },
{ _id: ObjectId('6636483d3e7fbdb6982202de'), emp_name: 'Samantha' },
{ _id: ObjectId('6636483d3e7fbdb6982202df'), emp_name: 'Cole' },
{ _id: ObjectId('6636483d3e7fbdb6982202e0'), emp_name: 'Kunal' },
{ _id: ObjectId('6636483d3e7fbdb6982202e1'), emp_name: 'Joe' }
]

To select multiple fields, we can pass all the field names to the $project stage. For instance, we can pass the document {emp_name:1, salary:1} to the $project stage to select the name and salary of the employees.

codecademy> db.employees.aggregate([{$project: {emp_name:1,salary:1}}])

Output:

[
{
_id: ObjectId('6636483d3e7fbdb6982202d8'),
emp_name: 'Ankit',
salary: 10000
},
{
_id: ObjectId('6636483d3e7fbdb6982202d9'),
emp_name: 'Aditya',
salary: 12000
},
......
]

By default, we always get the document’s ID in the output. To exclude this, we can set the _id key to 0 in the document passed to the $project stage.

codecademy> db.employees.aggregate([{$project: {emp_name:1,salary:1,_id:0}}])

Notice that we don’t get the _id field in the output after setting _id to 0:

[
{ emp_name: 'Ankit', salary: 10000 },
{ emp_name: 'Aditya', salary: 12000 },
{ emp_name: 'Adam', salary: 13000 },
{ emp_name: 'Katy', salary: 12000 },
{ emp_name: 'Akshat', salary: 15000 },
{ emp_name: 'Jasmine', salary: 14000 },
{ emp_name: 'Samantha', salary: 16000 },
{ emp_name: 'Cole', salary: 20000 },
{ emp_name: 'Kunal', salary: 25000 },
{ emp_name: 'Joe', salary: 15000 }
]

Conclusion

Understanding the different stages in MongoDB aggregation pipelines can help you easily solve analytical questions. Here are some key takeaways from this article:

  1. MongoDB aggregation pipelines consist of one or more stages that perform specific operations on documents.
  2. A particular stage in the MongoDB pipeline filters data, groups documents, sorts them, or selects specific fields from the documents.
  3. To combine different stages together and execute the MongoDB pipelines, we use the aggregate() function.

To learn more about solving analytical problems using the discussed concepts, you can read this article on MongoDB aggregation pipelines.