Key Takeaways
- Database schema, including indexes, need to be in source control.
- Data that controls business logic such as lookup tables also need to be in source control.
- Developers need a way to easily create local databases.
- Shared database should only be updated via a build server.
A robust DevOps environment requires having continuous integration for every component of the system. But far too often, the database is omitted from the equation, leading to problems from fragile production releases and inefficient development practices to simply making it harder to onboard new programmers.
In this article, we discuss the unique aspects of databases, both relational and NoSQL, in a successful continuous integration environment.
Source Control for Schema
The first issue to be addressed is source control and schemas. It is not appropriate to have developers apply database changes in an ad hoc manner. That would be the equivalent to making changes to JavaScript files by editing them directly on the production server.
When planning your source control for the database, make sure you capture everything. This includes, but is not limited to:
- Tables or Collections
- Constraints
- Indexes
- Views
- Stored Procedures, Functions, and Triggers
- Database configuration
You may be thinking, “I’m using a schemaless database so I don’t need source control”. But even then, you are going to need to account for indexes and overall database configuration settings. If your indexing scheme is different in your QA and production database, it will be very difficult to perform meaningful performance tests.
There are two basic types of source control for databases, which we’ll call “whole-schema” and “change script”.
Whole-Schema Source Control
“Whole-schema” source control is where your source control looks like the way you want your database to look. When using this pattern, you can see all the tables and views laid out exactly as they are meant to be, which can make it easier to understand the database without having to deploy it.
An example of whole-schema source control for SQL Server is SQL Server Data Tools(SSDT). In this tool all of your database objects are expressed in terms of CREATE scripts. This can be convenient when you want to prototype a new object using SQL, then paste the final draft directly into the database project under source control.
Another example of whole-schema source control is Entity Framework Migrations. Instead of SQL scripts, the database is primarily represented by C#/VB classes. But again, you can get a good picture of the database by browsing the source code.
When working with whole-schema source control, you usually don’t write your migration scripts directly. The deployment tools figure out what changes are needed for you by comparing the current state of the database with the idealized version in source control. This allows you to rapidly make changes to the database and see the results. When using this type of tool, I rarely alter the database directly and instead allow the tooling to do most of the work.
Occasionally the tooling isn’t enough, even with pre- and post- deployment scripts. In those cases, the generated migration script will have to be hand-modified by a database developer or DBA, which can break your continuous deployment scheme. This usually happens when there are major changes to a table’s structure, as the generated migration script can be inefficient in these cases.
Another advantage of whole-schema source control; it supports code analysis. For example, if you alter the name of a column but forget to change it in a view, SSDT will return a compile error. Just like static typing for application programming languages, this catches a lot of errors and prevents you from attempting to deploy clearly broken scripts.
Change Script Source Control
The other option is “change script” source control. Instead of storing the database objects themselves, you store the list of steps necessary to create the database objects. The one I’ve used successfully in the past is Liquibase, but they all work pretty much the same.
The main advantage of a change script tool is you have full control of what the final migration script will look like. This makes it easier to perform complex changes such as when tables are split or combined.
Unfortunately, there are several disadvantages to this style of source control. The first is the need to write the change scripts. While it does give you more control, it is also very time consuming. It is much easier to just add a property to a C# class than to write out the ALTER TABLE script in longhand.
This becomes even more problematic when working with things like SchemaBinding. For those who are unfamiliar with the term, this enables some advanced features in SQL Server by locking down the table schema. If you make changes to a table or view’s design, you have to first remove the SchemaBinding from any views touching the table or view. And if those views are schema-bound by other views, they too need to be temporarily unbound. All of this boilerplate, while easy for a whole-schema source control tool to generate, is hard to do correctly by hand.
Another disadvantage of change script tools is they tend to be very opaque. For example, if you want to know the columns on a table without deploying it, you need to read every change script that touches the table. This makes it really easy to miss something.
Which source control model to use?
When starting a new database project from scratch, I always choose whole-schema source control when available. This allows developers to work much faster and requires less knowledge to use successfully so long as you have one person to set it up.
For existing databases, especially ones in production for a number of years, change script source control is often more appropriate. Whole-schema tools make certain assumptions about the database design, while change script tools work with any database. Also, whole-schema tools are harder to build so they may simply be not available for your particular database.
Data Management
Tables can be broadly classified as “managed”, “user”, or “hybrid” depending on the nature of the data they contain. The way you treat these tables is different depending on which category it falls into.
Managed Tables
A common mistake when putting databases under source control is forgetting the data. Invariably there are going to be “lookup tables” that hold data which users are not meant to modify. For example, this could contain table-drive logic for business rules, the various status codes for a state machine, or just a list of key-value pairs that match an enum in the application code.
The data in this type of table should be treated as source code. Changes to it need to go through the same review and QA process you would use for any other code change. And it is especially important the changes are automatically deployed along with other application and database deployments, otherwise those changes may be accidentally forgotten.
In SQL Server Data Tools, I handle this using a post-deployment script. In this script, I populate a temporary table with what the data should look like, then use a MERGE statement to update the real table.
If your source control tool doesn’t handle this well, you could also build a stand-alone tool to perform the data updates. What’s important isn’t how you do it, but whether or not the process is easy to use and reliable.
User Tables
User tables are any table where a user can add or modify data. For our purposes, this includes both tables they can modify directly (e.g. name & address) and tables modified indirectly by their actions (e.g. shipping receipts, logs).
Real user data is almost never directly added to source control. However, it is a good practice to have realistic looking sample data available for development and testing. This could be directly in the database project, stored elsewhere in source control, or if particularly large, kept on a separate file share.
Hybrid Tables
A hybrid table is one that stores both managed and user data. There are two ways this can be partitioned.
A column partition is when users can modify some columns, but not others. In this scenario, you treat the table just like a normal managed table but with the additional restriction, you never update a user-controlled column.
A row partition is when you have certain records users cannot modify. A common scenario I run into is the need for hard-coded values in the Users table. On larger systems, I may have a separate user ID for each micro-service that can make changes independently from any real person. For example, a user may be called “Bank Data Importer”.
The best way I have found for managing row-partitioned hybrid tables is via reserved keys. When I define the identity/auto-number column, I’ll set the initial value at 1,000, leaving users 1 through 999 to be managed through source control. This requires a database that allows you to manually set a value in an identity column. In SQL Server, this is done via the SET IDENTITY_INSERTcommand.
Another option for dealing with this scenario is to have a column called “SystemControlled” or something to that effect. When set to 1/true, it means the application will not allow direct modifications. If set to 0/false, then the deployment scripts know to ignore it.
Individual Developer Databases
Once you have the schema and data under source control, you can take the next step and start setting up individual developer databases. Just as each developer should be able to run their own instance of a web server, each developer who may be modifying the database design needs to be able to run their own copy of the database.
This rule is frequently violated, to the determent of the development team. Invariably developers making changes to a shared environment are going to interfere with each other. They may even get into “deployment duels”, where two developers each try to deploy changes to the database. When this happens with a whole-schema tool, they will alternately revert the other’s changes without even realizing it. If using a change-script tool, the database may be placed in an indeterminate state where the migration scripts no longer work and it needs to be restored from backups.
Another issue is schema drift. This is when the development database no longer matches what’s in source control. Over time, developer databases tend to accumulate non-production tables, test scripts, temporary views, and other cruft to be cleared out. This is much easier to do when each developer has their own database they can reset whenever they want.
The final and most important issue is that service and UI developers need stable platform to write their code. If the shared database is constantly in flux, they can’t work efficiently. At one company I worked at, it was not usual to see developers shout, “services are down again” and then spend an hour playing video games while the shared database was put back together.
Shared Developer and Integration Databases
The number one rule for a shared developer or integration database is no one should be modifying the database directly. The only way a shared database should be updated is via the build server and the continuous integration/deployment process. Not only does this prevent schema drift, it allows scheduled updates to reduce disruptions.
As a rule of thumb, I allow the shared developer database to be updated whenever a check-in is made the relevant branch. This can be somewhat disruptive, but usually it is manageable. And you do need a place to verify the deployment scripts before they hit integration.
For my integration databases, I tend to schedule deployments to happen once per day along with any services. This provides a stable platform which UI developers can work against. Few things are as frustrating to UI developers as not knowing if code that suddenly starts failing was their mistake or a problem in the services/database.
Database Security and Source Control
An often-overlooked aspect of database management is security. Specifically, which users and roles have access to which tables, views, and stored procedures. In the all too familiar worst-case scenario, the application gets full access to the database. It can read and write to every column of every table, even ones it has no business touching. Data breaches often result from exploitations of flaws in a minor utility application with far more access rights than it needs.
The main objection to locking down the database is, “we don’t know what will break”. Because it has never been locked down before, developers literally have no idea what permissions the application actually needs.
The solution to this is to put the permission in source control from day one. This way, when the developer tests the application, it will fail from the onset if the permissions are incorrect. This in turn means by the time it gets to QA, all the permissions are settled and there is no guesswork or risk of a missing permission.
Containerization
Depending on the nature of your project, containerization of the database is an optional step. To illustrate why, I’ll offer two use cases.
For our first use case, the project has a very simple branching structure: there is a “dev” branch that feeds into a QA branch, which in turn feeds into Staging and finally Production. This can be handled by four shared databases, one for each stage in the pipeline.
In the second use case, we have a set of major feature branches. Each major feature branch is further subdivide into a dev and QA branch. Features have to pass QA before becoming candidates for merging into the main branch, so each major feature needs its own test environment.
In the first use case, containerization is probably a waste of time even if your web services do need containers. For the second use case, containerization of some sort is critical. If not real containers (e.g. Docker), then at least deployment scripts that can generate new environments as needed (e.g. AWS or Azure) when creating new major feature branches.
About the Author
Jonathan Allen got his start working on MIS projects for a health clinic in the late 90's, bringing them up from Access and Excel to an enterprise solution by degrees. After spending five years writing automated trading systems for the financial sector, he became a consultant on a variety of projects including the UI for a robotic warehouse, the middle tier for cancer research software, and the big data needs of a major real estate insurance company. In his free time he enjoys studying and writing about martial arts from the 16th century.