InfoQ

News

Continuous Integration And Version Control for Databases

Posted by Mike Bria on Feb 18, 2008 05:01 AM

Community
Architecture,
Agile
Topics
Agile Techniques ,
Artifacts & Tools ,
Modeling
Tags
Relational Databases ,
Refactoring ,
MySQL ,
Releases ,
XP ,
ORM ,
Continuous Integration
After a post asserting that one must, as a rule, always version their database work, Scott Allen detailed an approach to making the best of versioning databases. Allen presented a comprehensive, practical approach to creating the baseline, using change scripts to manage schematic revisions, controlling programmatic database objects (like views, stored procedures, functions, and triggers), and leveraging branching and merging.

Allen began the series after a post where he gives what his experience has shown as the three rules for development with relational databases. The rules are:

1. Never use a shared database server for development work.
Like many conveniences in software development, a shared database is a tar pit waiting to fossilize a project. Developers overwrite each other's changes. The changes I make on the server break the code on your development machine. Remote development is slow and difficult. Avoid using a shared database at all costs, as they ultimately waste time and help produce bugs.
2. Always Have a Single, Authoritative Source For Your Schema
Everyone should know where the official schema resides, and have a frictionless experience in getting a fresh database setup. I should be able to walk up to a computer, get the latest from source control, build, and run a simple tool to setup the database (in many scenarios, the build process can even setup a database if none exists, so the process is one step shorter).
3. Always Version Your Database
the common goal is to propagate changes from development, to test, and ultimately to production in a controlled and consistent manner. A second goal is to have the ability to recreate a database at any point in time. This second goal is particularly important if you are shipping software to clients. If someone finds a bug in build 20070612.1 of your application, you must be able to recreate the application as it appeared in that build - database and all.
Allen states the goal of versioning databases as being able to push out changes in a consistent, controlled, testable, and reproducible manner. Many coaches would agree that achieving this goal is paramount for an agile team to be effective.

Having established the importance of versioning databases, Allen continues with a series of 4 posts describing his preferred approach to achieving it.

The first of these describes what Allen asserts as the starting point for versioning a database - generating a baseline schema. Ultimately, the baseline is a script or collection of scripts containing all the SQL commands required to generate the application's database from scratch. This would include SQL commands to create every object (table, constraint, function, view, index, etc), commands to populate lookup tables with static data, and commands to insert any bootstrap data needed by the application. Once created and proven to be accurate, Allen says to "commit the file(s) to source control" and "consider your schema baselined!"

To create this baseline, Allen advises use of a tool that can retrospectively create the scripts from an existing database (as opposed to writing them each by hand), and describes his preference as to the structure of the resulting script files:
I like to keep all of the SQL needed to create tables, constraints, defaults, and primary indexes in a single file. Any views, stored procedures, and functions are scripted one per file.

If you go the multiple file approach, make sure to write a batch file, shell script, application, or some other form of automation that can automatically locate and run all of the script files required to install the database. Human intervention in this process is a step backwards.
Allen stresses the suggestion to include in this baseline a table that will record any future changes to the schema, then procedes in his next three posts to describe the how these changes should occur.

First, Allen discusses the mechanism he uses to manage changes to schema objects other than views, stored procedures, and functions - change scripts.  This approach specifies that each change (or set of related changes) be represented as an "incremental" update in a newly created script file, similar to that of Ruby Migrations. In a nutshell, as a team discovers the database requires a change, they create and test a new script that alters the database in the desired fashion (including any necessary data migration) and commit it to source control. Once published, the script is never to be changed again.

Allen's approach then for making updates to views, stored procedures, and functions is the complete opposite as to other database objects - one "create command" file per object, then update this file as updates to the object are required. As to why he prefers this, Allen states:
The simple reason is to find problems as early as possible. If someone commits a schema change and the change removes a column used by a view, you'll find out there is an error early – hopefully before the build escapes for testing. Likewise, if someone checks in a view but forgets to publish a schema change the view needs, someone else is going to show up at their desk a few minutes later asking why they are breaking the software.

A second reason is to avoid some rare errors I've seen. Some databases have a tendency to screw up execution plans when the schema changes underneath a view. Dropping everything and starting over avoids this problem, which is really hard to track down.
Allen strongly stresses the importance of using an automated tool to make optimal use of the strategies listed above:
The magic happens when a developer, tester, or installer updates from source control and runs a tool that updates their local database. The tool uses a three step process:

1. The tool applies new schema changes by comparing the available schema change files to the SchemaChangeLog records in the database.
2. The tool will DROP all stored procedures, views, and functions in the database.
3. The tool will run all of the scripts needed to add the views, stored procedures, and functions back into the database.
As for the benefits of using this strategy, particularly the automated tool, Allen gives these examples:
Since the schema change scripts are in source control, you can recreate your database as it looked at any point in time. Is a customer reporting a bug on build 3.1.5.6723? Pull the source code tagged or labeled with that version number and run the baseline, then all schema change scripts included in the tag. You now have the same database and have a much better chance to recreate the bug. Also, changes move from development to test, and ultimately into production in a consistent, orderly, and reproducible manner.
Allen completes his series with a discussion about how the approach he describes handles branching and merging, as both are likely realities of any application that lives past its first version. Allen suggests Branch For Release as his preferred branching strategy, and explains that he re-baselines the database with each new release. He presents an example describing this, and adds a scenario in which a defect is found in an older, branched release that requires a schema change. On the branched release, a new script for the change is created with no problem. The question then posed is how to reflect the change on the current release, now residing on the mainline:
To get this fix into the mainline, there are two options. Well, actually there are an infinite number of options to consider depending on how you apply your updates, but here are two options:

1. Merge the schema change script into the mainline as 01.00.0046, and fix the 2.0 baseline script to incorporate this change.
2. Write a new schema change script, 02.00.0003, that has the same changes as change 46 in the branch.

With option #1 you have to be careful because any database that updated to v2.0 will not take the 46th change script from the branch (unless you write your tools differently than I do). You have to force people to run this script manually, or you go around destroying any existing v2.0 databases (which at this point. should only be on development and test machines anyway). This is not a great option, but if you are not deep into 2.0 it is sometimes viable.

Option #2 is a bit friendlier. The v1.0 databases will pick up the fix from 01.00.0046. The v2.0 databases will pick up the fix from 02.00.0003. You have to be careful though, to write the 02.00.0003 change script so that it won't try to reapply changes if the 01.00.0046 script ran.

In other words, databases installed from the v2.0 baseline script need to apply the 02.00.0003 script, but production type databases that have been around since 1.0 will use the 01.00.0046 script, and you don't want 02.00.0003 to create an error by making changes that are already in place when the database eventually updated to v2.0.
For more on this and related subjects, see Scott Ambler's take on agile database development.

Related Sponsor

VersionOne is recognized by Agile practitioners as the leader in Agile project management tools. Companies such as Adobe, BBC, CNN, Dow, HP, IBM, Sony and 3M have turned to VersionOne to help deliver greater value to their customers.
dbmigrate by Sam Pullara Posted Feb 18, 2008 2:05 PM
Re: dbmigrate by Alex Popescu Posted Feb 18, 2008 6:12 PM
Can't stress the need for automated tooling enough by Damon Edwards Posted Feb 18, 2008 3:34 PM
migration tools by Francois Ward Posted Feb 18, 2008 8:04 PM
  1. Back to top

    dbmigrate

    Feb 18, 2008 2:05 PM by Sam Pullara

    Most of this is implemented for Java and Grails with this project: http://code.google.com/p/dbmigrate Sam

  2. Back to top

    Can't stress the need for automated tooling enough

    Feb 18, 2008 3:34 PM by Damon Edwards

    Hi Mike, Thanks for the great job summarizing these posts. As you point out, Scott stresses the need for automated tooling for database "builds" and deployment. You'll probably also want to consider also using tool like ControlTier to tie together both your code deployments and your data deployments. Having one button to push to get both done is about the only way you can ensure your developers and QA staff get it right each and every time. ControlTier: http://open.controltier.com ControlTier's automation library for migrating and deploying database updates (similar to dbmigrate for Rails): http://moduleforge.controltier.com/lib_rdb.html

  3. Back to top

    Re: dbmigrate

    Feb 18, 2008 6:12 PM by Alex Popescu

    Sam, I am wondering how complete do you consider the project. I have quickly checked it a couple of weeks back, and all I could find was some code with no documentation, no clear examples, etc. So, I had to move along and check the LiquiBase. ./alex -- .w( the_mindstorm )p.

  4. Back to top

    migration tools

    Feb 18, 2008 8:04 PM by Francois Ward

    Tools like ActiveRecord:Migration do the versioning pretty well. Here though, we're a .NET shop, and while there's some projects on their way to replicate it, most are in beta, or didn't do what we need... Just whipped out one day of work and made our own based on SQL Server Management Objects (we use SQL Server exclusively) with a bit of Reflection and Attribute magics, and we were going strong. Now we can get our database to any point in time easily. Working on local, unshared database instances for development is also definately a must. The only catch is having a comprehensive set of test data that follows a given schema.

Educational Content

Bindings, Platforms, and Innovation

This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.

Orchestrating Long Running Activities with JBoss / JBPM

This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.

Neo4j - The Benefits of Graph Databases

This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.

Realistic about Risk: Software development with Real Options

This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.

Communication Flexibility Using Bindings

This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.

Writing DSLs in Groovy

After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.

Scaling Agile with C/ALM (Collaborative Application Lifecycle Management)

IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.

Concurrent Programming with Microsoft F#

Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.