In this blog post, I will summarize the useful facts about building and managing indexes in MongoDB. Indexes enable us to perform various queries more efficiently. They are an integral part of application development. It is very important for developers to know what are indexes and how to utilize them.

I have a lot of experience with MongoDB development. Despite that, I planned for a long time to refresh my knowledge about this very popular database. To do that, I have picked this book: "MongoDB - The Definitive Guide" by Shannon Bradshaw. It's a great book, and a lot of the things I will write in this post will be my notes from reading it.

How indexes work

An index is simply put a map/hash that tells us the exact location in a database for a certain entry we are looking for. Without it, we would have to go through an entire database to find the data, which is obviously not good.

A useful comparison, as it's stated in the book, is an index page of a book. It helps us go directly to the page we are interested in, without going through an entire book.

Seeding test data

First of all, we need to create some data to work with. I will create a test collection named users and seed one million number of documents with the following script:

// generates a random number between the given min and max (inclusive)
function numberBetween(min, max) {
  return Math.floor(Math.random() * (max - min + 1) + min);
}

const usersToCreate = [];
for (const index of [...Array(1000000).keys()]) {
  usersToCreate.push({
    name: `user_${index}`,
    email: `user_${index}@email.org`,
    age: NumberInt(numberBetween(18, 118)),
    isPremiumUser: Boolean(numberBetween(0, 1)), // random boolean
    about: `This is a hardcoded text, with a random ${index} index.`,
  });
}

db.users.insertMany(usersToCreate);

We can see in the script that a single user entry will have the following properties:

name: string
email: string
age: number
isPremiumUser: boolean
about: string

This data set will be used throughout the entire blog post.

Creating indexes

Indexes in MongoDB can be created in multiple ways, depending on what language or platform you are using. We are going to use the standard Mongo Shell.

db.users.createIndex({ about: 1 });
/*
{
  "numIndexesBefore" : 0,
  "numIndexesAfter" : 1,
  "createdCollectionAutomatically" : false,
  "ok" : 1.0
}
*/

This is how a simple index is created. The output tells us the number of indexes that existed for this collection before and after running this command.

Index types

MongoDB has several types of indexes. We will list them here:

single field indexes
compound indexes
multikey indexes
geospatial indexes
text indexes
hashed indexes

We will go through the first three types since they are the most frequently used, and describe them in more detail. We will skip the others since they are very specialized for certain use cases.

Single field indexes

Single field indexes are in a way the simplest out of all listed above. ☝️ Like the name says, they contain only a single key.

Let's first try a query on this newly created collection before creating any indexes:

db.getCollection('users')
  .find({ name: 'user_900000' })
  .explain('executionStats');
/*
{
  ...
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 1,
    "executionTimeMillis" : 680,
    "totalKeysExamined" : 0,
    "totalDocsExamined" : 1000000,
    ...
  }
  ...
}
*/

We are able to see the output of this particular MongoDB query in a JSON format. The two properties that are important for us now are totalDocsExamined and executionTimeMillis. If we pay attention to the totalDocsExamined, we can see that the number is 1000000, i.e. the exact number of docs we inserted in our script listed above. ☝️ That means that MongoDB had go through all of the entries in the users collection in order to retrieve the one that we asked. That is reflected in the executionTimeMillis as well, because on my machine the number was 680 milliseconds (this number will probably differ slightly depending on the machine). If we look at the totalKeysExamined property, that can also tell us that no indexes were used in this query.

Ok, now let's create our single field index:

db.users.createIndex({ name: 1 });
/*
{
  "numIndexesBefore" : 1,
  "numIndexesAfter" : 2,
  "createdCollectionAutomatically" : false,
  "ok" : 1.0
}
*/

We got this output and it signifies to us that the index was created successfully. Now we can run the same query to find a single user and see the output:

db.getCollection('users')
  .find({ name: 'user_900000' })
  .explain('executionStats');
/*
{
  ...
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 1,
    "executionTimeMillis" : 3,
    "totalKeysExamined" : 1,
    "totalDocsExamined" : 1,
    ...
  }
  ...
}
*/

We've got much better results now! The executionTimeMillis has been reduced by multiple orders of magnitude, to just 3ms! We can also see that the totalDocsExamined is 1, meaning that only one entry has been accessed by MongoDB. That is possible because the index we have created was used. We can confirm that by looking at the totalKeysExamined property, which is now 1.

This example shows us how powerful the indexes can be. The difference here would be even larger if the data set is bigger.

Compound indexes

Our first index had used only a single key. For many query patterns it will be necessary to build indexes based on multiple keys. This type of index is called a compound index.

Next section will try to show a couple of clear benefits of using compound indexes.

Note: Before every example I will describe what index we should create. Because of the fact that every following index is more efficient than the last, MongoDB will automatically choose that index for the executed query. But if you would like to force a query manually, that can be done with the cursor.hint() command.

First benefit - faster queries

The first reason for creating a compound index is simply to support a query contains two or more keys. We will see that in the following scenario. Let's imagine that we want to query our created users collection by two properties, isPremiumUser and age, and let's say that there's already a one-key index for the isPremiumUser property. Let's create that index in the Mongo Shell: db.users.createIndex({ isPremiumUser: 1 });. Now if we run the query, we can see the following results:

db.users
  .find({ isPremiumUser: true, age: { $lte: 60 } })
  .explain('executionStats');
/*
{
  ...
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 212997,
    "executionTimeMillis" : 677,
    "totalKeysExamined" : 500293,
    "totalDocsExamined" : 500293,
    ...
  }
  ...
}
*/

The results show us that MongoDB had examined and pinpointed around half of the index map for the newly created index, which we can see in this property "totalKeysExamined" : 500293, and that makes sense because when we seeded the data the isPremiumUser property was set randomly to either true or false. But then, MongoDB had to go and scan all the documents that matched the retrieved index entries, seen by the "totalDocsExamined" : 500293 property, in order to find all the users with the age property less or equal to 60. This we can see by the difference between the total docs examined - 500293, and total docs returned - 212997. The execution time was good, but not great - 677 ms.

Let's create a compound index that will cover both of the keys passed in the query: db.users.createIndex({ isPremiumUser: 1, age: 1 });. Now we can run the same query again:

db.users
  .find({ isPremiumUser: true, age: { $lte: 60 } })
  .explain('executionStats');
/*
{
  ...
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 212997,
    "executionTimeMillis" : 444,
    "totalKeysExamined" : 212997,
    "totalDocsExamined" : 212997,
    ...
  }
  ...
}
*/

We can immediately see that the execution time is better, it's 444ms. Viewing the results, we can see that the MongoDB could find all the docs that satisfy the query in the index. The only thing left to do after is to fetch them. The matching values of "totalKeysExamined", "totalDocsExamined" and "nReturned" can show us that. This shows us one benefit of using a compound index to cover all the keys of the query.

Second benefit - better sorts

The second important reason for using a compound index is to support sorting. Indexes can be very helpful in sorting, because when they are created by MongoDB they are already sorted.

Equality query + sorting

This subsection will describe the sorting behavior when we use equality queries.

Let's first execute an equality query with sorting without having a compound index to support it. We already have a single key index for the isPremiumUser property. When we run a query targeting this index but with a sorting at the end, we will get this result:

db.users
  .find({ isPremiumUser: true })
  .sort({ email: 1 })
  .explain('executionStats');
/*
{
  ...
  "stage" : "SORT",
  ...
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 500293,
    "executionTimeMillis" : 2345,
    "totalKeysExamined" : 500293,
    "totalDocsExamined" : 500293,
    ...
  }
  ...
}
*/

Now let's add the following compound index: db.users.createIndex({ isPremiumUser: 1, email: 1 });. If we run the same query we will get the following:

db.users
  .find({ isPremiumUser: true })
  .sort({ email: 1 })
  .explain('executionStats');
/*
{
  ...
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 500293,
    "executionTimeMillis" : 685,
    "totalKeysExamined" : 500293,
    "totalDocsExamined" : 500293,
    ...
  }
  ...
}
*/

When we compare the two executions we can see that the number of docs examined is the same, but the second query is around four times faster! Why is that? The second query was faster because MongoDB didn't have to do any in-memory sorting. Since the entries of the compound index were sorted by the first key (isPremiumUser) and then by the second key (email), MongoDB could just simply fetch the documents based on those entries. The first query on the other hand shows us that after the fetching phase, MongoDB needed to sort the retrieved data. This can be seen in the explain output by the "stage" : "SORT", subsection.

Sorting in-memory is time consuming of course, and it gets even worse for larger sets of data. By creating a compound index, MongoDB could run the second query more efficiently and without any sorting.

Range query + sorting

This subsection will describe the sorting behavior when we use range queries. Here we will not cover the case when we don't have a compound index. We have already seen that they are useful when doing queries with a sort.

Instead, we will analyze the order of the keys in compound queries. This is especially important when using range queries.

We will now try a query with the age and email keys. Let's create the following index: db.users.createIndex({ age: 1, email: 1 });. The entries in this index map will be sorted by the age property first, and by the name property second.

db.users
  .find({ age: { $gte: 50 } })
  .sort({ email: 1 })
  .explain('executionStats');
/*
{
  ...
  "stage" : "SORT",
  ...
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 683264,
    "executionTimeMillis" : 4836,
    "totalKeysExamined" : 683264,
    "totalDocsExamined" : 683264,
    ...
  }
  ...
}
*/

If we look at the execution stats, we can see that the number of (index) keys and docs examined matches the number of entries returned - which is great. But the bad thing is that MongoDB had to the sort in-memory. That's the reason for the execution time of almost five seconds.

Now let's try the different order of the keys and create the following index: db.users.createIndex({ email: 1, age: 1 });. If we run the same query, MongoDB will actually automatically use the latest index. Let's see the results:

db.users
  .find({ age: { $gte: 50 } })
  .sort({ email: 1 })
  .explain('executionStats');
/*
{
  ...
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 683264,
    "executionTimeMillis" : 1483,
    "totalKeysExamined" : 1000000,
    "totalDocsExamined" : 1000000,
    ...
  }
  ...
}
*/

One improvement we can immediately spot is that an in-memory sort didn't happen. But there is also one bad thing we can spot, MongoDB had to go through the entire collection in order to fetch the needed data. But when we see the execution time, it is actually reduced by around four times.

There is a tradeoff here which cannot be avoided. In order to avoid an in-memory sort, like in the first example, we had to go through more (i.e. all) index keys than the documents which will be returned. But a general guideline is that if we can avoid an in-memory sort - we should. Even if that means scanning more keys, because that is the best solution for most of the times and it will allow us to scale more easily with respect to dataset size and system load.

The golden ESR (equality, sort, range) rule

Based on our analysis of the compound indexes, this rule can encompass all the best practices for building them. The rule is fairly simple. When we are building an index with multiple keys, we should order the keys in the following order:

keys for equality filters should appear first
keys used for sorting should appear before multivalue fields
keys for multivalue filters should appear last

Note: operators like $ne, $nin, $regex are range operators and fall into the third category.

Just to repeat, this will not be the answer for absolutely every dataset, but it will cover majority of cases.

Multikey indexes

This type of index is actually automatically created when one key from the index is an array. We can observe that if we look at execution stats. The isMultiKey is set to true for a multikey index. This means that a multikey index can be either a single field index or a compound index.

What happens is that MongoDB actually creates an index key for each element in the array. This is a very handy feature. One downside is that this type of index can grow quickly if the array that's indexes gets large. Because of that, when building a compound index, we can only have one indexed array field. If there are more than one, upon running the createIndex command MongoDB will report an error.

Index properties

Other than the types, we can categorize indexes based on a set of properties they can have (if configured):

unique indexes
partial indexes
sparse indexes
ttl indexes

Unique indexes

A unique index guarantees that each value will appear only once in the index. We can create one like this:

db.users.createIndex({ email: 1 }, { unique: true });

If a unique index is set in place, and we try to insert a document that has the same indexed property like one of the existing documents, MongoDB will raise an error E11000 duplicate key error and that insertion will be aborted.

db.users.insertOne({
  name: 'user_1',
  email: 'user_1@email.org',
  age: 21,
  isPremiumUser: false,
  about: 'Random about section',
});
/*
{
  "errmsg" : "E11000 duplicate key error collection...
}
*/

Both single fields indexes and compound indexes can be unique. For a compound unique index individual keys can have the same value, but the combination of all the indexed fields must be unique.

Every collection in MongoDB must have one unique index. It's for the _id fields ans is automatically generated.

Partial indexes

This type of index is used when we want to create an index for only a subset of documents in a collection. We can configure what subset will that be based on a filter expression.

By indexing a subset of the documents in a collection, partial indexes have lower storage requirements and reduced performance costs for index creation and maintenance.

We can create a partial index like this:

db.users.createIndex(
  {
    age: 1,
  },
  {
    partialFilterExpression: {
      age: { $lte: '21' }, // this is a filter expression
    },
  },
);

A partial index can also be a unique index. These two types aren't mutually exclusive. A frequent use case for this combination is when we want a field to be unique unless it has a default value. Some people follow a pattern of having a standard property, let it be an email of type string , and then having a default value of null for when the property hasn't been set explicitly. If we would want the email to be unique, but only if it isn't set to null, we would create the index like this:

db.users.createIndex(
  {
    email: 1,
  },
  {
    partialFilterExpression: {
      email: { $type: 'string' },
    },
  },
);

Sparse indexes

The sparse property of an index simply ensures that the index will only contain entries for documents that have that indexed field. The rest of documents that don't have that field defined will be skipped.

To create one, we must run this command:

db.users.createIndex({ nickname: 1 }, { sparse: true });

TTL indexes

This is a type of index which MongoDB can use to automatically remove documents from a collection after a certain amount time. When a document reaches the set timeout, it will be automatically deleted.

A common use case is storing user sessions, because the best practice there is to automatically delete the session after a small period of time.

Negative sides of indexes

An index can speed up queries dramatically. However they have their cost. All modifying operations would have to also update the existing indexes on every insert, update or delete. Meaning that those operations will have a little overhead. This is most of the times acceptable, and indexes still bring major read improvements. But it does mean that we should be careful when creating a new index, and check if it is really necessary.

In some rare occasions an indexes can actually make reads more slow, meaning that it should be avoided. That can happen if a collection is really small, or a query has poor selectivity. In that case the hypothetical index would cause an index scan and a collection scan, as opposed of just a collection scan if the index didn't exist. In these occasions, an index is maybe better avoided.

Bonus - covered queries

A standard way of how MongoDB uses an index in a query is to quickly find the wanted entries in the index map and then use those entries as references to fetch the documents from the queried collection.

However, if we make a compound index with all the fields that we need to read why go and do a collection scan, right? Right! MongoDB understands this automatically and will skip the collection scan if it figures that there's no need to do it. The way to set this up is to use a projection. With a projection we can tell MongoDB what fields are of interest to us. One other requirement is that all of the fields in a query need to be located in the index.

Let's check this in an example. In our test collection we have already created the db.users.createIndex({ age: 1, email: 1 }); index. If we do a query like this:

db.users
  .find({ age: { $lte: 30 } }, { age: 1, email: 1, _id: 0 })
  .explain('executionStats');
/*
{
  ...
  "stage" : "PROJECTION_COVERED",
  ...
  "executionStats" : {
    "executionSuccess" : true,
    "nReturned" : 128709,
    "executionTimeMillis" : 147,
    "totalKeysExamined" : 128709,
    "totalDocsExamined" : 0,
    ...
  }
  ...
}
*/

we can see that the stage name is PROJECTION_COVERED. This means that MongoDB executed this query as a cover query. Another sign that this was the case is the totalDocsExamined property which is set to 0.

Covered queries are a great way to optimize the reading speed and should be utilized whenever possible.

Summary

This blog post listed some of the most important aspects of indexes in MongoDB, although there are a lot of things that we have skipped.

MongoDB is a very flexible and feature rich database. Knowing how to optimally implement the indexing is a very important requirement for every back-end developer.

DarkMannn

Who

What

How

Indexing In MongoDB