BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Introducing Obevo: Get Your Database SDLC under Control

Introducing Obevo: Get Your Database SDLC under Control

Bookmarks

Key Takeaways

  • Obevo is an enterprise-grade database deployment tool developed at Goldman Sachs and released in 2017 as an open source project under the Apache 2.0 License.
  • Obevo allows database scripts to be organized per object, similar to application code, providing many advantages to developers.
  • Obevo can help both new applications and systems with existing databases get their DB change management under SDLC control.
  • Teams can get started quickly with Obevo’s onboarding tooling and guided examples.
  • Additional features include rollback support, in-memory database testing, and phased deployments. 

In recent years, Goldman Sachs adopted a standard software development lifecycle (SDLC) for application build and deployment. This included managing database schema definitions in source control for both new and existing systems, a more difficult endeavor than managing application code. In this article, we will describe how Obevo, our recently open-sourced DB Deployment utility, helped many of our enterprise applications get their databases under SDLC control.

The Database Deployment Problem Space at an Enterprise

Onboarding database definitions to a standard SDLC process is challenging for many reasons, notably the stateful nature of databases and the need to perform incremental migrations. Hence, many applications have not historically had an automated or smooth database deployment process. Our goal was to get database schemas managed under the same SDLC that applications go through: with all definitions committed to a version control system (VCS) and deployed via a standard build/release mechanism.

This effort was complicated by the diversity of use cases in our database estate:

  1. The modern: brand new schemas deploying tables, doing in-memory testing, and incorporating proper SDLC from the start
  2. The old: systems over ten years old that never had a controlled deployment process
  3. The complex: hundreds or thousands of objects of many types, such as tables, views, stored procedures, functions, static data uploads, and data migrations
  4. The time-consuming: tables with millions of data rows, and deployments taking hours

Regardless of the use case, the SDLC process itself has complexities having to do with a large number of geographically distributed developers actively making changes.

While existing open source tools could do the job for simpler cases, they could not handle the scale and complexity of some of our existing systems. And we could not just leave these existing systems without a proper SDLC; they are critical systems with active development and releases.

Thus, we developed Obevo (available under the Apache 2.0 License), a tool to handle all such use cases. Obevo’s key differentiator is the ability to maintain DB objects per file (similar to how class definitions are typically stored per file) while still handling incremental deployments.

In this article, we will describe the DB Deployment problem space and then demonstrate how the object-based project structure helps us elegantly manage hundreds and thousands of schema objects for a variety of object and environment types.

Quick Obevo Feature List: Something for Everyone!
For New Systems For Existing and Complex Systems

Simple to maintain, review, and deploy

Selectively deploy tables for easier usage in tests  

In-memory test database conversion of most table scripts

Easy to reverse-engineer and onboard

Stateful (tables) and stateless (views, SPs, functions, …) objects supported  

Easy to support hundreds or thousands of objects

DB Object Types (Stateful vs. Stateless)

First, let’s review the deployment semantics for different DB object types, as this impacts the tooling design.

Quick terminology note for this article:

  • The act of applying some code/SQL to modify a database will be called a deployment or migration
  • The units of code that get deployed will be referred to as scriptlets
  • A file can potentially contain multiple scriptlets, i.e. a scriptlet is not equivalent to a script file. Whether a file should contain one scriptlet or multiple is a topic within this article.

Stateful objects (e.g. tables)

A stateful object requires incremental modifications to its definition, instead of a full definition replacement. Take the following example of adding two columns to MyTable.

Ideally, we could bring the database to the end state by specifying a SQL statement that defines a table with all four columns. Alas, SQL does not provide a workable solution for this:

  1. Dropping and recreating the table means losing your data
  2. Saving the data in a temporary table could be an expensive and complex operation

Instead, relational database management systems (RDBMS) allow users to specify ALTER statements to modify an existing table.

Some column updates may require data migrations, such as moving column data from one table to another.

Thus, each object is a result of applying multiple scriptlets; the initial scriptlet will create the object, and subsequent ones will modify it.

Stateless objects (e.g. views, stored procedures)

A stateless object, on the other hand, can be created and modified by specifying its full object definition.

  • “DROP + CREATE” statements or a “CREATE OR REPLACE” statement will work here
  • “DROP + CREATE” is safe as these objects have no data/state

We also consider static data files (i.e. code or reference data tables) as stateless objects. Though it involves table data, the data is fully defined in your scriptlet and deployable to the table, whether via bulk “delete + insert” or selective “insert + update + delete” operations.

DB Deploy Tool Principles

This article from Martin Fowler is an excellent read on the key principles most source-control based DB deploy tools follow, with these specifically relevant points.

1) All database artifacts are version controlled with application code (versus managing from a UI)

UI-based management may be acceptable for less-technical users, but we recommend development teams manage their database scriptlets in source control (as they should with the rest of their application estate) and invoke deployments in an automated manner.

2) Explicitly coded incremental migrations are preferred for stateful objects (versus auto-calculated migrations)

Auto-calculating migrations, for example between the current database table state and the complete view in code, is risky in our opinion in an enterprise environment.

Database Deployment Tool Requirements

We evaluate database deployment tools based on how they handle these requirements:

A) Deploy incremental changes to an existing database

This is the primary functionality of a database deploy tool; most production deployments are executed like this. This methodology is also used in some non-production environments, particularly for testing deployments in a QA environment prior to production release.

B) Deploy a full schema to a blank database

Developers may want to deploy to an empty sandbox database for:

  1. Verifying that your SQL scriptlets actually deploy
  2. Running system integration tests involving your database (e.g. testing a new column or stored procedure)
  3. Testing database access code in your unit tests while leveraging in memory databases

This is doable in a couple ways:

  1. By replaying all migration scriptlets from the beginning, assuming that prior scriptlets have been preserved in your package
  2. By re-baselining the scriptlets such that they can deploy to a blank database, while still allowing subsequent incremental production deployments to occur.

C) Be maintainable and readable

Prior to our database deployment overhaul, we saw a few teams who maintained a file per database object that contained the object’s definition, even though those files were not used for deployments.

This seemed to have no purpose, but we gleaned some insights:

  • Developers like to have a visual in code of their database objects and structure
  • ORM tools like Hibernate and Reladomo generate DDLs that show such a structure, and it is a bonus to tie these back to your database instance

General Database Deployment Tool Design

Main Deploy Algorithm

Given the principles above, most source-control-based DB Deploy tools (including Obevo) work as follows:

  1. Developers write scriptlets for their next release, adding to previously deployed scriptlets already in source control.
  2. Scriptlets are tested and built into a package
  3. The package is deployed against the target database
    1. Simpler tools require deployers to specify the scriptlets to deploy
    2. More advanced tools determine which scriptlets need deployment by comparing against a deploylog table, per the figure below
      1. This allows the same package and deploy command to be used against any environment, regardless of the package version previously deployed on it

Stateful and Stateless Object Deployment Semantics

The object type has an impact on the changeset calculation semantics.

  • Stateless objects allow scriptlet additions, removals, and updates: the object definition (assuming it is valid) can replace the existing definition in the database without losing data.
  • Stateful objects, however, only typically allow adding scriptlets: updating already-deployed scriptlets could cause the object to be formed differently than intended.

To demonstrate the stateful use case, let’s deploy package version 1 to get the table on the right

Now let’s say someone modified M3 to rename the column, and then we re-deploy. What do we expect?

The tool detects a mismatch:

  • The scriptlet M3 has changed and wants to add column C123456.
  • But the database already has column C deployed.
  • The source scriptlets no longer include column C, yet have no way to remove it from the database.

Thus, the general rule: stateful object scriptlets cannot be modified once they are deployed.

Certain opt-in functionalities let us work around this if needed, such as:

  • Rollback: providing an explicit rollback scriptlet to use when undeploying
  • Re-baselining: rewriting already-deployed scriptlets into a more concise scriptlet without attempting to redeploy

Differentiating Implementation Choices

Given that their underlying algorithms are similar, deployment tools vary based on a couple implementation points.

1) How to organize scriptlets into files

Options include grouping scriptlets together by:

  • Release
  • The object modified
  • Not grouping scriptlets at all, and keeping individual migrations

2) How to order scriptlets for deployment

Options to consider:

  • A separate file that lists the precise migration order
  • A file naming convention that determines the order
  • Dependency analysis on the scriptlets that implies the order

Next, we will detail how Obevo addressed these two points.

Obevo Design: Object-Based Scriptlet Organization

Our main database deployment problems were around managing the large number of objects in a schema: to develop, to maintain, and to deploy them. We also had developers actively working on their database objects in conjunction with coding their applications.

Hence, we wanted to provide an experience that would be relatable for developers, which led us to organize our scriptlets based on the object name. We will dive into those details in this section.

(This structure adds some challenges on the ordering, which we will cover in the next section of this article)

Project Structure Basics

We organize scriptlets based on the objects for which those scriptlets applied, per the following example.

The file structure differs based on whether the object is stateful or stateless.

  • Stateless objects can just store the definition itself in the file, as their full definition can be deployed against the database
  • Stateful objects, however, need to be deployed using incremental scriptlets.
    • Hence, multiple scriptlets would be needed to bring an object to its current state, and we keep all those in the same file
    • We split the file into multiple scriptlets by demarcating each section with a line beginning with “//// CHANGE”
 

Stateful Object Representation

Stateless Object Representation

Scriptlet-to-file cardinality

1-to-many scriptlets per file

1 scriptlet per file

Scriptlet naming convention

<objectName>.<changeName>

<objectName>

Example

Analysis: Stateless Object Handling

The object-based structure is convenient for stateless objects as the full stateless object definition can be maintained in a file and modified easily in place.

For comparison, it is technically possible to handle stateless object deployments in an incremental stateful manner, e.g. as incremental scriptlets that persist over multiple releases. However, this leads to redundancy in database scriptlets as objects change over multiple releases.

Analysis: Readability

This project structure gives us a few advantages from a maintenance perspective:

The database structure is very readable from the project structure.

  • To make or review a change on an object, it is clear which file to look for.
  • Writing or reviewing changes can be done while easily viewing the object definition in the same place, instead of viewing the object definition elsewhere (i.e. in other files or the database itself).
  • Though scriptlets could accumulate within a stateful object file, it can be mitigated with the re-baseline feature that merges multiple scriptlets into one without executing any deployments.

For a comparison point, take an example project structure where a file was associated with a migration or release, as many tools support. This can lead to a few issues:

  • Lack of object readability: If an object is modified over many releases, its structure will either be dispersed across multiple files (for stateful objects) or simply redundant in many files (as demonstrated for stateless objects previously).

  • Accumulation of unreadable and unwritable objects: unreadable due to the previous point, unwritable per the rule of not modifying stateful object scriptlets
  • While re-baselining could reduce the file count, it would have to be done across a full schema instead of a per-object basis. But compared to an object-based project structure:
    • The re-baselining effort would be much greater
    • The resultant re-baselined file would be much larger and complex to read through

From a code review / release review perspective: our object-based structure meant all the changes in a particular release would be dispersed across files. At first blush, it may seem difficult to review the scriptlets to be deployed for the next release. However, we can still review code for the release by comparing VCS history and tags - the same way we would do for application code.

Analysis: Developer Benefits

Developers get a couple benefits with the Obevo project structure.

As scriptlets for an object are colocated in a single file, we can easily deploy individual objects in tests, which proves useful for cases like unit testing a data access API in an in-memory database.

Developers can also leverage ORM tools to generate DDLs from their application classes and reconcile them against their migration scriptlets. For brevity, we will not delve into this here, but you can read more in our documentation.


Obevo Design: Ordering via Dependency Analysis

While choosing an object-based project organization gave many benefits detailed in the last section, it complicated the ordering problem.

Objects can depend on each other, and as we scale up to hundreds and thousands of objects in a schema, it becomes increasingly difficult to manually prescribe an order.

Let’s describe how we overcame these challenges.

Ordering Algorithm

Not all scriptlets depend on each other, and so we have some flexibility on our ordering within the obvious constraints declared by object dependencies. Hence, we devise a solution using a simple graph algorithm.

Take the following example statements:

  • 3 to create tables
  • 1 to establish a foreign key
  • 1 to create a view:

Note the following:

  • The order in which we create tables does not matter
  • The foreign key must be created after TABLE_A and TABLE_B
  • VIEW1 must be created after TABLE_A

This lends itself to a directed graph representation, where the graph nodes are the scriptlets and the edges are the ordering dependencies.

We can now use the topological sort algorithm to come up with an acceptable ordering that maintains these ordering constraints and will deploy our database successfully.

The topological sort can emit many acceptable orderings, but we tune our algorithm usage to give a single consistent ordering so that we have consistent behavior across environments.

Now for the last detail: how do we declare the dependencies in our scriptlets?

Dependency declaration and discovery

We found the simplest way is to declare the dependencies in your scriptlet. See the dependencies attribute in the TABLE_B.fkA scriptlet below.

However, this is not developer-friendly for large databases (imagine annotating hundreds or thousands of objects), so we need a way to automatically detect dependencies while still allowing developer-overrides as needed.

We use two strategies to infer dependencies:

Intra-object dependencies for stateful migrations:

We allow stateful objects to define multiple scriptlets. It is natural to imply that the migrations within a file are deployed in the same order they are written, and so we infer such dependencies in our graph.

Cross-object dependencies via text search

To detect cross-object dependencies, we need to search the scriptlet content to find dependent objects.

The technically ideal way would be to parse the SQL syntax to find those objects. However, this is very non-trivial as we’d have to understand the SQL syntax for all the DBMS types that we support.

Instead, Obevo takes a simple and naive approach: select the object names in your project that are found via string-search and assume those are the dependencies.

Notes on the implementation:

  • The available object names can be found by simply listing the files in your project
  • There are various ways to search for the object names in the scriptlet. Our current implementation breaks up the scriptlet into tokens via whitespace, then checks if the token exists in the set of object names
  • There are more nuances in the algorithm, but the above will suffice for this article

Here are the algorithm results for our previous example:

In case of false positive matches (e.g. due to comments) or false negatives, developers can specify exclusion or inclusion overrides as needed.

At first glance, it may be hard to imagine this working for actual use cases, but we have used this technique successfully to deploy many complex schemas, some spanning thousands of objects such as tables, stored procedures, views, etc.

If you would like to see an example in practice, please see our kata lesson that goes through an example reverse engineering of a large database schema.

Handling Data Migrations Across Multiple Tables

We would like to quickly mention this use case (i.e. moving data from an old column to a new column, then dropping the old column), as at first it may seem more difficult to apply the object-based file structure concept to this.

Obevo can handle this - in short, we provide the concept of a “migration” object to help with this, which:

  1. Lets us define update scriptlets to facilitate these migrations
  2. Allows each object scriptlet to keep only scriptlets relating to its definition, thus preserving its ability to be deployed individually for testing

For more information, please see the documentation.

Reverse-engineering of existing schemas

Hopefully we have shown that you can deploy very complex schemas with Obevo. But to truly allow existing systems to onboard to Obevo, we must make it easy for developers to reverse-engineer their existing databases.

This is not a trivial problem to solve, as unfortunately there is no single API that works perfectly across different DBMS types.

  • Java + JDBC provides the DatabaseMetaData API, but it is inconsistently implemented across different DBMSs
  • Some third-party tools try to bridge the gap, but do not cover all detailed SQL syntax that vendors may expose, or may lag in covering for new features that DBMSs may release

Hence, we chose to integrate with vendor-provided reverse engineering tools (see table below). Some tools simply emit the full schema to a single file, but we provide a utility that can convert such files into Obevo’s object-based structure, leveraging simple string-parsing and regular expression techniques. We find this technique to be more reliable for existing systems, particularly since the core vendor tools know best on how to expose their own schema compared to Java APIs.

DBMS

Tooling Leveraged by Obevo

DB2

DB2LOOK

Postgres

pg_dump

SQL Server

Microsoft.SqlServer.Management.Smo.Scripter class in Powershell

Oracle

DBMS_METADATA API via JDBC

Sybase ASE

ddlgen

Conclusion

While many open source tools exist to help with database deployment, we felt that our more complex use cases required more powerful tooling.

With Obevo, we aim to support all types of systems; whether enhancing the productivity of modern systems with our testing features and easy object-based maintenance or reinvigorating long-lived systems by facilitating their onboarding to SDLC control where they previously had none.

There are more features and database deployment activities that we haven’t covered here (e.g. rollback, phased deployments, in-memory DB testing, multiple schema management). Feel free to visit our Github page, documentation, and kata lessons to learn more about Obevo and how to apply the tool to your system.

We hope to write more articles about our practices and DB uplift experiences, so feel free to post comments and ask any questions you have.

About the Author

Shant Stepanian is a Senior Engineer in the Platforms business unit at Goldman Sachs. Shant is the chief architect of Obevo and led the database SDLC overhaul at Goldman. He has worked on a variety of system architectures, from batch-based reporting applications based on databases to high-throughput OLTP systems leveraging distributed and partitioned in-memory data grids.

Rate this Article

Adoption
Style

BT