# 14 Things I Wish I’d Known When Starting with MongoDB

| Posted by Phil Factor 0 Followers , reviewed by Jonathan Allen 594 Followers on Sep 13, 2018. Estimated reading time: 9 minutes |

## Key Takeaways

• Even though MongoDB doesn’t enforce it, it is vital to design a schema.
• Likewise, indexes have to be designed in conjunction with your schema and access patterns.
• Avoid large objects, and especially large arrays.
• Be careful with MongoDB’s settings, especially when it concerns security and durability.
• MongoDB doesn’t have a query optimizer, so you have to be very careful how you order the query operations.

I’ve been a database person for an embarrassing length of time, but I only started working with MongoDB recently. When I was starting out with MongoDB, there are a few things that I wish I’d known about. With general experience, there will always be preconceptions of what databases are and what they do. In hopes of making it easier for other people, here is a list of common mistakes.

## Failing to design a schema

MongoDB doesn’t enforce a schema. This is not the same thing as saying that it doesn’t need one. If you really want to save documents with no consistent schema, you can store them very quickly and easily but retrieval can be the very devil.

The classic article ‘6 Rules of Thumb for MongoDB Schema Design’ is well worth reading, and features like Schema Explorer from third-party tools such as Studio 3T is well worth having for regular schema check-ups.

## Forgetting about collations (sort order)

This can result in more frustration and wasted time than any other misconfiguration. MongoDB defaults to using binary collation. This is helpful to no cultures anywhere. Case-sensitive, accent-sensitive, binary collations were considered curious anachronisms in the eighties along with beads, kaftans and curly moustaches. Now, they are inexcusable. In real life a motorbike is the same as a Motorbike. Britain is the same place as britain. Lower-case (minuscule) is merely a cursive equivalent of an upper-case (majuscule) letter. Don’t get me started about the collation of accented characters (diacritics).  When you create a MongoDB database, use an accent-insensitive, case-insensitive collation appropriate to the languages and culture of the users of the system. This makes searches through string data so much easier.

## Creating collections with large documents

MongoDB is happy to accommodate large documents of up to 16 MB in collections, and GridFS is designed for large documents over 16MB. Because large documents can be accommodated doesn’t mean that it is a good idea. MongoDB works best if you keep individual documents to a few kilobytes in size, treating them more like rows in a wide SQL table. Large documents will cause several performance problems.

## Creating documents with large arrays

Documents can contain arrays. It is best to keep the number of array elements well below four figures. If the array is added to frequently, it will outgrow the containing document so that its location on disk has to be moved, which in turn means every index must be updated. A lot of index rewriting is going to take place when a document with a large array is re-indexed, because there is a separate index entry for every array element. This re-indexing also happens when such a document is inserted or deleted.

MongoDB has a ‘padding factor’ to provide space for documents to grow, in order to minimize this problem.

You might think that you could get around this by not indexing arrays. Unfortunately, without the indexes, you can run into other problems. Because documents are scanned from start to end, it takes longer to find elements towards the end of an array, and most operations dealing with such a document would be slow.

## Forgetting that the order of stages in an aggregation matters

In a database system with a query optimizer, the queries that you write are explanations of what you want rather than how to get it. It is like ordering in a restaurant; you usually just order the dish, rather than give detailed instructions to the cook.

In MongoDB, you are instructing the cook. For example, you need to make sure that the data is reduced as early as possible in the pipeline via $match and$project, sorts happen only once the data is reduced, and that lookups happen in the order you intend. Having a query optimizer that removes unnecessary work, orders the stages optimally, and chooses the type of join can spoil you. MongoDB gives you more control, but at a cost in convenience.

Tools like Studio 3T make it simpler to build accurate MongoDB aggregation queries. Its Aggregation Editor feature lets you apply pipeline operators one stage at a time, and you can validate inputs and outputs at each stage for easier debugging.

## Using fast writes

Never set MongoDB for high-speed writes with low durability. This ‘file-and-forget’ mode makes writes appear to be fast because your command returns before actually writing anything. If the system crashes before the data is written to disk, it is lost and risks being in an inconsistent state. Fortunately, 64-bit MongoDB has journaling enabled.

The MMAPv1 and WiredTiger storage engine both use journaling to prevent this, though WiredTiger can be restored to the last consistent checkpoint during recovery if journaling is switched off.

Journaling will ensure that the database is in a consistent state when it recovers and will save all the data up to the point that the journal is written. The duration between journal writes is configurable using the commitIntervalMs run-time option.

To be confident of your writes, make sure that journaling is enabled (storage.journal.enabled) in the configuration file and the commit interval corresponds with what you can afford to lose.

## Sorting without an index

In searches and aggregations, you will often want to sort your data. Hopefully, it is done in one of the final stages, after filtering the result, to reduce the amount of data being sorted. Even then, you will need an index that can cover the sort. Either a single or compound index will do this.
When no suitable index is available, MongoDB is forced to do without. There is a 32MB memory limit on the combined size of all documents in the sort operation and if MongoDB hits the limit, it  will either produce an error or occasionally just return an empty set of records.

## Lookups without supporting indexes

Lookups perform a similar function to a SQL join. To perform well, they require an index on the key value used as the foreign key. This isn’t obvious because the use isn’t reported in explain(). These indexes are in addition to the index recorded by explain() that is used by the $match and$sort pipeline operators when they occur at the beginning of the pipeline. Indexes can now cover any stage an aggregation pipeline.

The db.collection.update() method is used to modify part or all of an existing document or replace an existing document entirely, depending on the update parameter you provide. It is less obvious that it doesn’t do all the documents in a collection unless you set the multi parameter to update all documents that match the query criteria.

## Forgetting the significance of the order of keys in a hash object

In JSON, an object consists of an unordered collection of zero or more name/value pairs, where a name is a string and a value is a string, number, boolean, null, object, or array.

Unfortunately, BSON attaches significance to order when doing searches. The order of keys within embedded objects matters in MongoDB, i.e. { firstname: "Phil", surname: "factor" } does not match { { surname: "factor", firstname: "Phil" }. This means that you have to preserve the order of name/value pairs in your documents if you want to be sure to find them.

## Conclusions

The only way that you could end up feeling disappointed in MongoDB is if you compare it directly with another type of database such as an RDBMS, or come to it with particular expectations. It is like comparing an orange with a fork. Database systems have their purposes. It is best to just understand and appreciate these differences. It would be a shame to pressure the developers of MongoDB down a route that forced them towards an RDBMS way of doing things, and I’d like to continue to see new and interesting ways of solving old problems such as ensuring the integrity of data and making data systems resilient to failure and malice.

MongoDB’s introduction of ACID transactionality in version 4.0 is a good example of introducing important improvements in an innovative way. Multi-document, multi-statement transactions are now atomic, and it is possible to adjust the time allowed to acquire locks, and to expire hung transactions, as well as to change the isolation level.

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has nearly forty years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

## Get the most out of the InfoQ experience.

### Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Schemas can be enforced via Schema validation using JSON Schema.

The article referenced to point out that there are problems with large documents turned out not to be an issue with large documents. 16MB documents may be a problem, 1MB documents will not be a problem in most cases.

MMAPV1 is deprecated as a storage engine so references to padding factor should be ignored for most users of MongoDB.

Any MongoDB user should also look at our Compass product which has comparable functionality to Studio3T include aggregation support in the latest version.

Remember efficient querying needs indexes.

Use a three-node replica set in development from day one. Get used to understanding how elections and node failures impact application performance.

Know what the graphs look like on a database with no activity, activity and lots of activity.

Run your db with --auth so that you force authentication to be part of the development process rather than a last minute add-on.

Everybody will find life easier if they take their first outing on MongoDB with MongoDB Atlas our database in the cloud which takes away the pain and leaves the database :-)

Finally, don't try and do it all yourself. MongoDB is here to help,

Joe Drumgoole
MongoDB

Some of these comments are fair.

I'm sure there will be some reflexive comments to this post (hell I was going to say something). After reading some of the points, I have to admit some of these tripped me up when I first started using Mongo.

Specifically,
- multi: true
- indices
- large documents

Perhaps these items could be better documented (maybe called out more).

However, the one thing I think is an unforgivable failure of the developer is to believe Schema-less DBs imply that you don't need to model your domain. Just because Mongo will accept any "shape" of a document doesn't mean you should do that for a transactional model. This is particularly true of models with aggregates, particularly those with 1:many and many:many relations. Trying to maintain the lifecycle of those entities when they exist as an item in an array of another entity is not fun.
Close

#### by

on

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.