MongoDB database migration thoughts

TLDR: Database migrations are not as straight forward as they assume. Think about the context and the assumptions of the application, before your data grows. Start by tinkering about downtime, compatibility and consistency.

Intro

Sooner or later nearly every application probably needs some sort of database migration. Sure there are some applications which don't even use a database and then also don't need to migrate something, but I would say most of the cloud applications running out there in the wild, are using a database to store things. If your application is still under development the changes are good that things will change and your current data does no longer match the schema which you would like to work with. As an example a boolean could become an enumeration, or you need to introduce a complete new field, as the data grows. For such scenarios you want your users to be able to easily migrate to the next version without losing their data. As an example if a property was a boolean before and needs to be an enumeration now, to represent multiple states the data needs to be migrated to the new state, e.g. true maps to ready and false maps to in approval.

There are many types of applications and of databases. For now, I will focus on a containerized cloud application which is deployed through IaC with the CDK to AWS into fargate with MongoDB atlas as a database, as this maps to my experience. Some thoughts may be transferable to other types of applications, other databases and other deployment strategies. Where the most importation question probably are your application constraints, like what amount of downtime you could deal with.

Naive migrations

There is a pretty straight forward option to handle database migrations. You could easily do migrations inside a lambda function or an init-container as part of the CDK stack. The lambda function is invoked immediately when the stack is deployed, where the migration runs before your container is switched to the new version. With such an architecture the container will only be switched if the database migration was successful otherwise the stack will be rolled back and the old version will still run. With such a configuration no new container has to deal with old data. I will call this the alpha-team strategy, as this is the first solution to a not yet existing problem. This is a naive solution for a complex problem with does not solve all problems properly. The problems coming with such a database migration are heavily dependent on the context and the assumptions of the application, like if a downtime due to database migrations is viable. As an example the easiest approach I can imagine, is to stop all old containers, do the database migration, start the new containers. Within this approach no one will ever write or read to the database while a migration is ongoing and the application itself never has to deal with incomplete data or data schemas. But as a side effect the application would suffer from a downtime, every time a database migration is ongoing as long as the database migration takes. This might be possible if the migration itself is finished within seconds but not if it takes multiple hours, e.g. when changing a property name or type for millions of documents.

The Alpha-Team-Strategy works well for some types of migrations e.g. for index migrations, as you can create a new index definition, where the build of the index itself is done in the background and not during the deployment, the index is properly set for the way the new container will access the data, and if some indexes will be removed during the migration, there is only a short amount of time when the old container is still running, with properly slow queries as the index was already removed. But even some index migrations can not stand by them self without data migration. Just think about what happens if an index should be switched to a unique index, you probably need some sort of logic to remove eventual duplicates to build such a unique index, which will either result in dump data loss or a sort of migration to combine or choose which data to keep. Furthermore, the Alpha-Team-Strategy has some more caveats, when used for data migration. As the old container is still running while the database migration is ongoing, and the old container will not stop to read and, especially important, write to the database, there is the chance that even after a migration the old container writes data within the old data schema, which gives you some data with the old data schema in your migrated database, so your migration was without any effect. This specially applies to MongoDB, as you can write every document you want, independent of any declared schema. Furthermore, the migrated data should be compatible within the old schema, at least within a level, that the old container does not crash while reading new data, as it is running while some data is already migrated. Just to think this further, the new container also has to be able to deal with the new and the old schema, as we found out, both of them could be in our database, even after the migration. So your old container and your new container, both has to be compatible with the old and the new data schema, which is such a strong binding, you in fact don't have to do any schema migration at all (it would be enough to only migrate your indexes). So if no downtime is viable, the Alpha-Team-Strategy does not work and would not help us here. Of course, you could combine the Alpha-Team-Strategy with the Ostrich Algorithm and just pretend there is no problem at all, as this is working if no one is accessing the application while the migration is running.

|---- Container A Running ----|
                 |- Migration-|
                              |---- Container B Running ----|

As the migration is running, while the old container is running, you can not ensure to have no more old data after the migration has finished.

|---- Container A Running ----|
                              |- Migration -|
                              |- Downtime --|
                                            |---- Container B Running ----|

The application would need to deal with a downtime as long as the migration is running.

Complex migrations

We won't come around the fact that either the old container has to be compatible within the new data schema or the new container has to be compatible with the old data schema, otherwise a migration without downtime is impossible. One of these two is better than the other. If your old container has to be compatible with your new data schema this heavily restricts how you can migrate your schema, e.g. it is impossible to replace an object with an array, as the old container tries to parse the object and can not handle the array. If your new container has to handle the old and the new schema, you can build custom logic to read the old schema from the db and convert it to the object you need e.g. read the object and wrap it in an array with a single element. You would need some sort of schema versioning and an attribute to know which document is in which schema. After that you can build new containers which are able to read the old schema and the new schema, but will only write in the form of the new schema. Within this approach you could easily add one of the new containers while still running an old container next to it. So as a first step you would replace all old containers with the new containers which all can read the old and the new data schema and will only write the new data schema. At this point the first half of your migration is done, as frequently written documents will migrate by the self. But a document which will never be changed, will never migrate, so now you can run some sort of aggregation pipeline, which runs in the background and converts all documents from the old schema to the new schema, which is fully compatible and will leave the collection in a consistent state with only the new schema, as at this time only the new schema is written.

Even this approach is not pointless, as if your new schema is not readable by the old container, you possibly can have already some writes with the new data schema, while an old container is still running and will possibly result in some error, but first, there is always another container next to it, which will result in successful reads and second, the time this can happen is only as long, as fargate needs to replace all containers. If this is not suitable, you would need to write some logic to exclude documents which have a schema version which is not known by the old container, which could possible result in some missing data, during the time fargate switches the containers. An extra thought: you probably don't need to keep all schema versions within the application, like you now from some migration tools out there, which can migrate from the very first version of your schema to the latest version, by doing all the steps described by the migration itself. Instead, your application only needs to support two schemas, the last one, and the current one. Within this setup you can migrate all your data, by running the different versions of your application after each other on the same database.

|---- Container A Running ----|
                            |---- Container B Running ----|
                              |- Migration -|

Migrate the data, after all old containers are stopped

These are just some thoughts about a complex problem which scales with your data. When starting development and testing your application with only hundreds of documents, the Alpha-Team solution just works, and no one would ever notice a downtime of a few seconds, but as soon as you are running your workload on production with terabytes of data your downtime will increase and problems are increasing. We all know that state management is hard, so it is a good practice to think about your states before problems, as this is probably also hard to test.