Key Takeaways
- Obevo is an enterprise-grade database deployment tool developed at Goldman Sachs and released in 2017 as an open source project under the Apache 2.0 License.
- Obevo allows database scripts to be organized per object, similar to application code, providing many advantages to developers.
- Obevo can help both new applications and systems with existing databases get their DB change management under SDLC control.
- Teams can get started quickly with Obevo’s onboarding tooling and guided examples.
- Additional features include rollback support, in-memory database testing, and phased deployments.
In recent years, Goldman Sachs adopted a standard software development lifecycle (SDLC) for application build and deployment. This included managing database schema definitions in source control for both new and existing systems, a more difficult endeavor than managing application code. In this article, we will describe how Obevo, our recently open-sourced DB Deployment utility, helped many of our enterprise applications get their databases under SDLC control.
The Database Deployment Problem Space at an Enterprise
Onboarding database definitions to a standard SDLC process is challenging for many reasons, notably the stateful nature of databases and the need to perform incremental migrations. Hence, many applications have not historically had an automated or smooth database deployment process. Our goal was to get database schemas managed under the same SDLC that applications go through: with all definitions committed to a version control system (VCS) and deployed via a standard build/release mechanism.
This effort was complicated by the diversity of use cases in our database estate:
- The modern: brand new schemas deploying tables, doing in-memory testing, and incorporating proper SDLC from the start
- The old: systems over ten years old that never had a controlled deployment process
- The complex: hundreds or thousands of objects of many types, such as tables, views, stored procedures, functions, static data uploads, and data migrations
- The time-consuming: tables with millions of data rows, and deployments taking hours
Regardless of the use case, the SDLC process itself has complexities having to do with a large number of geographically distributed developers actively making changes.
While existing open source tools could do the job for simpler cases, they could not handle the scale and complexity of some of our existing systems. And we could not just leave these existing systems without a proper SDLC; they are critical systems with active development and releases.
Thus, we developed Obevo (available under the Apache 2.0 License), a tool to handle all such use cases. Obevo’s key differentiator is the ability to maintain DB objects per file (similar to how class definitions are typically stored per file) while still handling incremental deployments.
In this article, we will describe the DB Deployment problem space and then demonstrate how the object-based project structure helps us elegantly manage hundreds and thousands of schema objects for a variety of object and environment types.
Quick Obevo Feature List: Something for Everyone! | |
For New Systems | For Existing and Complex Systems |
Simple to maintain, review, and deploy Selectively deploy tables for easier usage in tests In-memory test database conversion of most table scripts |
Easy to reverse-engineer and onboard Stateful (tables) and stateless (views, SPs, functions, …) objects supported Easy to support hundreds or thousands of objects |
DB Object Types (Stateful vs. Stateless)
First, let’s review the deployment semantics for different DB object types, as this impacts the tooling design.
Quick terminology note for this article:
|
Stateful objects (e.g. tables)
A stateful object requires incremental modifications to its definition, instead of a full definition replacement. Take the following example of adding two columns to MyTable.
Ideally, we could bring the database to the end state by specifying a SQL statement that defines a table with all four columns. Alas, SQL does not provide a workable solution for this:
- Dropping and recreating the table means losing your data
- Saving the data in a temporary table could be an expensive and complex operation
Instead, relational database management systems (RDBMS) allow users to specify ALTER statements to modify an existing table.
Some column updates may require data migrations, such as moving column data from one table to another.
Thus, each object is a result of applying multiple scriptlets; the initial scriptlet will create the object, and subsequent ones will modify it.
Stateless objects (e.g. views, stored procedures)
A stateless object, on the other hand, can be created and modified by specifying its full object definition.
- “DROP + CREATE” statements or a “CREATE OR REPLACE” statement will work here
- “DROP + CREATE” is safe as these objects have no data/state
We also consider static data files (i.e. code or reference data tables) as stateless objects. Though it involves table data, the data is fully defined in your scriptlet and deployable to the table, whether via bulk “delete + insert” or selective “insert + update + delete” operations.
DB Deploy Tool Principles
This article from Martin Fowler is an excellent read on the key principles most source-control based DB deploy tools follow, with these specifically relevant points.
1) All database artifacts are version controlled with application code (versus managing from a UI)
UI-based management may be acceptable for less-technical users, but we recommend development teams manage their database scriptlets in source control (as they should with the rest of their application estate) and invoke deployments in an automated manner.
2) Explicitly coded incremental migrations are preferred for stateful objects (versus auto-calculated migrations)
Auto-calculating migrations, for example between the current database table state and the complete view in code, is risky in our opinion in an enterprise environment.
Database Deployment Tool Requirements
We evaluate database deployment tools based on how they handle these requirements:
A) Deploy incremental changes to an existing database
This is the primary functionality of a database deploy tool; most production deployments are executed like this. This methodology is also used in some non-production environments, particularly for testing deployments in a QA environment prior to production release.
B) Deploy a full schema to a blank database
Developers may want to deploy to an empty sandbox database for:
- Verifying that your SQL scriptlets actually deploy
- Running system integration tests involving your database (e.g. testing a new column or stored procedure)
- Testing database access code in your unit tests while leveraging in memory databases
This is doable in a couple ways:
- By replaying all migration scriptlets from the beginning, assuming that prior scriptlets have been preserved in your package
- By re-baselining the scriptlets such that they can deploy to a blank database, while still allowing subsequent incremental production deployments to occur.
C) Be maintainable and readable
Prior to our database deployment overhaul, we saw a few teams who maintained a file per database object that contained the object’s definition, even though those files were not used for deployments.
This seemed to have no purpose, but we gleaned some insights:
- Developers like to have a visual in code of their database objects and structure
- ORM tools like Hibernate and Reladomo generate DDLs that show such a structure, and it is a bonus to tie these back to your database instance
General Database Deployment Tool Design
Main Deploy Algorithm
Given the principles above, most source-control-based DB Deploy tools (including Obevo) work as follows:
- Developers write scriptlets for their next release, adding to previously deployed scriptlets already in source control.
- Scriptlets are tested and built into a package
- The package is deployed against the target database
- Simpler tools require deployers to specify the scriptlets to deploy
- More advanced tools determine which scriptlets need deployment by comparing against a deploylog table, per the figure below
- This allows the same package and deploy command to be used against any environment, regardless of the package version previously deployed on it
Stateful and Stateless Object Deployment Semantics
The object type has an impact on the changeset calculation semantics.
- Stateless objects allow scriptlet additions, removals, and updates: the object definition (assuming it is valid) can replace the existing definition in the database without losing data.
- Stateful objects, however, only typically allow adding scriptlets: updating already-deployed scriptlets could cause the object to be formed differently than intended.
To demonstrate the stateful use case, let’s deploy package version 1 to get the table on the right
Now let’s say someone modified M3 to rename the column, and then we re-deploy. What do we expect?
The tool detects a mismatch:
- The scriptlet M3 has changed and wants to add column C123456.
- But the database already has column C deployed.
- The source scriptlets no longer include column C, yet have no way to remove it from the database.
Thus, the general rule: stateful object scriptlets cannot be modified once they are deployed.
Certain opt-in functionalities let us work around this if needed, such as:
- Rollback: providing an explicit rollback scriptlet to use when undeploying
- Re-baselining: rewriting already-deployed scriptlets into a more concise scriptlet without attempting to redeploy
Differentiating Implementation Choices
Given that their underlying algorithms are similar, deployment tools vary based on a couple implementation points.
1) How to organize scriptlets into files
Options include grouping scriptlets together by:
- Release
- The object modified
- Not grouping scriptlets at all, and keeping individual migrations
2) How to order scriptlets for deployment
Options to consider:
- A separate file that lists the precise migration order
- A file naming convention that determines the order
- Dependency analysis on the scriptlets that implies the order
Next, we will detail how Obevo addressed these two points.
Obevo Design: Object-Based Scriptlet Organization
Our main database deployment problems were around managing the large number of objects in a schema: to develop, to maintain, and to deploy them. We also had developers actively working on their database objects in conjunction with coding their applications.
Hence, we wanted to provide an experience that would be relatable for developers, which led us to organize our scriptlets based on the object name. We will dive into those details in this section.
(This structure adds some challenges on the ordering, which we will cover in the next section of this article)
Project Structure Basics
We organize scriptlets based on the objects for which those scriptlets applied, per the following example.
The file structure differs based on whether the object is stateful or stateless.
- Stateless objects can just store the definition itself in the file, as their full definition can be deployed against the database
- Stateful objects, however, need to be deployed using incremental scriptlets.
- Hence, multiple scriptlets would be needed to bring an object to its current state, and we keep all those in the same file
- We split the file into multiple scriptlets by demarcating each section with a line beginning with “//// CHANGE”
Stateful Object Representation |
Stateless Object Representation |
|
Scriptlet-to-file cardinality |
1-to-many scriptlets per file |
1 scriptlet per file |
Scriptlet naming convention |
<objectName>.<changeName> |
<objectName> |
Example |
|
|
Analysis: Stateless Object Handling
The object-based structure is convenient for stateless objects as the full stateless object definition can be maintained in a file and modified easily in place.
For comparison, it is technically possible to handle stateless object deployments in an incremental stateful manner, e.g. as incremental scriptlets that persist over multiple releases. However, this leads to redundancy in database scriptlets as objects change over multiple releases.
Analysis: Readability
This project structure gives us a few advantages from a maintenance perspective:
The database structure is very readable from the project structure.
- To make or review a change on an object, it is clear which file to look for.
- Writing or reviewing changes can be done while easily viewing the object definition in the same place, instead of viewing the object definition elsewhere (i.e. in other files or the database itself).
- Though scriptlets could accumulate within a stateful object file, it can be mitigated with the re-baseline feature that merges multiple scriptlets into one without executing any deployments.
For a comparison point, take an example project structure where a file was associated with a migration or release, as many tools support. This can lead to a few issues:
-
Lack of object readability: If an object is modified over many releases, its structure will either be dispersed across multiple files (for stateful objects) or simply redundant in many files (as demonstrated for stateless objects previously).
- Accumulation of unreadable and unwritable objects: unreadable due to the previous point, unwritable per the rule of not modifying stateful object scriptlets
- While re-baselining could reduce the file count, it would have to be done across a full schema instead of a per-object basis. But compared to an object-based project structure:
- The re-baselining effort would be much greater
- The resultant re-baselined file would be much larger and complex to read through
From a code review / release review perspective: our object-based structure meant all the changes in a particular release would be dispersed across files. At first blush, it may seem difficult to review the scriptlets to be deployed for the next release. However, we can still review code for the release by comparing VCS history and tags - the same way we would do for application code.
Analysis: Developer Benefits
Developers get a couple benefits with the Obevo project structure.
As scriptlets for an object are colocated in a single file, we can easily deploy individual objects in tests, which proves useful for cases like unit testing a data access API in an in-memory database.
Developers can also leverage ORM tools to generate DDLs from their application classes and reconcile them against their migration scriptlets. For brevity, we will not delve into this here, but you can read more in our documentation.
Obevo Design: Ordering via Dependency Analysis
While choosing an object-based project organization gave many benefits detailed in the last section, it complicated the ordering problem.
Objects can depend on each other, and as we scale up to hundreds and thousands of objects in a schema, it becomes increasingly difficult to manually prescribe an order.
Let’s describe how we overcame these challenges.
Ordering Algorithm
Not all scriptlets depend on each other, and so we have some flexibility on our ordering within the obvious constraints declared by object dependencies. Hence, we devise a solution using a simple graph algorithm.
Take the following example statements:
- 3 to create tables
- 1 to establish a foreign key
- 1 to create a view:
Note the following:
- The order in which we create tables does not matter
- The foreign key must be created after TABLE_A and TABLE_B
- VIEW1 must be created after TABLE_A
This lends itself to a directed graph representation, where the graph nodes are the scriptlets and the edges are the ordering dependencies.
We can now use the topological sort algorithm to come up with an acceptable ordering that maintains these ordering constraints and will deploy our database successfully.
The topological sort can emit many acceptable orderings, but we tune our algorithm usage to give a single consistent ordering so that we have consistent behavior across environments.
Now for the last detail: how do we declare the dependencies in our scriptlets?
Dependency declaration and discovery
We found the simplest way is to declare the dependencies in your scriptlet. See the dependencies attribute in the TABLE_B.fkA scriptlet below.
However, this is not developer-friendly for large databases (imagine annotating hundreds or thousands of objects), so we need a way to automatically detect dependencies while still allowing developer-overrides as needed.
We use two strategies to infer dependencies:
Intra-object dependencies for stateful migrations:
We allow stateful objects to define multiple scriptlets. It is natural to imply that the migrations within a file are deployed in the same order they are written, and so we infer such dependencies in our graph.
Cross-object dependencies via text search
To detect cross-object dependencies, we need to search the scriptlet content to find dependent objects.
The technically ideal way would be to parse the SQL syntax to find those objects. However, this is very non-trivial as we’d have to understand the SQL syntax for all the DBMS types that we support.
Instead, Obevo takes a simple and naive approach: select the object names in your project that are found via string-search and assume those are the dependencies.
Notes on the implementation:
- The available object names can be found by simply listing the files in your project
- There are various ways to search for the object names in the scriptlet. Our current implementation breaks up the scriptlet into tokens via whitespace, then checks if the token exists in the set of object names
- There are more nuances in the algorithm, but the above will suffice for this article
Here are the algorithm results for our previous example:
In case of false positive matches (e.g. due to comments) or false negatives, developers can specify exclusion or inclusion overrides as needed.
At first glance, it may be hard to imagine this working for actual use cases, but we have used this technique successfully to deploy many complex schemas, some spanning thousands of objects such as tables, stored procedures, views, etc.
If you would like to see an example in practice, please see our kata lesson that goes through an example reverse engineering of a large database schema.
Handling Data Migrations Across Multiple Tables
We would like to quickly mention this use case (i.e. moving data from an old column to a new column, then dropping the old column), as at first it may seem more difficult to apply the object-based file structure concept to this.
Obevo can handle this - in short, we provide the concept of a “migration” object to help with this, which:
- Lets us define update scriptlets to facilitate these migrations
- Allows each object scriptlet to keep only scriptlets relating to its definition, thus preserving its ability to be deployed individually for testing
For more information, please see the documentation.
Reverse-engineering of existing schemas
Hopefully we have shown that you can deploy very complex schemas with Obevo. But to truly allow existing systems to onboard to Obevo, we must make it easy for developers to reverse-engineer their existing databases.
This is not a trivial problem to solve, as unfortunately there is no single API that works perfectly across different DBMS types.
- Java + JDBC provides the DatabaseMetaData API, but it is inconsistently implemented across different DBMSs
- Some third-party tools try to bridge the gap, but do not cover all detailed SQL syntax that vendors may expose, or may lag in covering for new features that DBMSs may release
Hence, we chose to integrate with vendor-provided reverse engineering tools (see table below). Some tools simply emit the full schema to a single file, but we provide a utility that can convert such files into Obevo’s object-based structure, leveraging simple string-parsing and regular expression techniques. We find this technique to be more reliable for existing systems, particularly since the core vendor tools know best on how to expose their own schema compared to Java APIs.
DBMS |
Tooling Leveraged by Obevo |
DB2 |
|
Postgres |
|
SQL Server |
Microsoft.SqlServer.Management.Smo.Scripter class in Powershell |
Oracle |
|
Sybase ASE |
Conclusion
While many open source tools exist to help with database deployment, we felt that our more complex use cases required more powerful tooling.
With Obevo, we aim to support all types of systems; whether enhancing the productivity of modern systems with our testing features and easy object-based maintenance or reinvigorating long-lived systems by facilitating their onboarding to SDLC control where they previously had none.
There are more features and database deployment activities that we haven’t covered here (e.g. rollback, phased deployments, in-memory DB testing, multiple schema management). Feel free to visit our Github page, documentation, and kata lessons to learn more about Obevo and how to apply the tool to your system.
We hope to write more articles about our practices and DB uplift experiences, so feel free to post comments and ask any questions you have.
About the Author
Shant Stepanian is a Senior Engineer in the Platforms business unit at Goldman Sachs. Shant is the chief architect of Obevo and led the database SDLC overhaul at Goldman. He has worked on a variety of system architectures, from batch-based reporting applications based on databases to high-throughput OLTP systems leveraging distributed and partitioned in-memory data grids.
Community comments
How does this differ from liquibase?
by Dimitrov Dimitar,
Re: How does this differ from liquibase?
by Shant Stepanian,
Re: How does this differ from liquibase?
by Shant Stepanian,
Question on stated alternative approach
by Richard Kallay,
Re: Question on stated alternative approach
by Shant Stepanian,
Is there already a discussion forum?
by Juergen Donnerstag,
Re: Is there already a discussion forum?
by Shant Stepanian,
General Question about versioned stateless objects
by Michael Graham,
Re: General Question about versioned stateless objects
by Shant Stepanian,
How does this differ from liquibase?
by Dimitrov Dimitar,
Your message is awaiting moderation. Thank you for participating in the discussion.
By reading this article, the approach taken by Obevo looks quite similar to Liquibase. Could anybody highlight the important differences?
Re: How does this differ from liquibase?
by Shant Stepanian,
Your message is awaiting moderation. Thank you for participating in the discussion.
The key difference is that Liquibase requires developers to specify an explicit total ordering of all DB changes, whereas Obevo does not.
Note the Liquibase databasechangelog doc (changes are executed “in the order they appear in the databaseChangeLog file”). While this is manageable for a smaller number of changes, it did not scale well for hundreds or thousands of database objects (a use we encountered frequently). Consider having to onboard an existing system with so many objects - your likely options are either to A) define all your objects in a single file (not easy to create and not easy to edit and view objects afterward) B) maintain each object in a separate file (requires a difficult task of creating the databaseChangeLog file with a correct order).
Obevo’s ability to infer the order without needing such a change ordering file (see “Ordering via Dependency Analysis” above) makes it much easier for existing systems to onboard with a maintainable source code structure. Such systems previously had no easy methodology to test their existing objects or test a full deployment in a new schema, and were resigned to only managing new incremental changes in their DB deploy process. We feel that our approach also makes sense for new systems to manage their database objects, as it would result in a more maintainable source code layout that matches their application code.
Re: How does this differ from liquibase?
by Shant Stepanian,
Your message is awaiting moderation. Thank you for participating in the discussion.
To be more succinct:
The “Object-Based Scriptlet Organization” (and its associated benefits as listed in the article) is Obevo’s differentiating feature from other tools. It is possible due to the approach described in “Ordering via Dependency Analysis”.
For a tangible example, see our sample project (with 500 tables, 1000 stored procedures, 100 views, and 10 user types), and compare having a single file with all DDLs in explicit order versus the Obevo object-based structure. See the kata home page if you’d like to try the sample.
Question on stated alternative approach
by Richard Kallay,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi Shant, I have a question from goldmansachs.github.io/obevo/overview.html
On that page it states
Can you expand upon what "maintain the latest version of the schema in your source" means?
Also, are there any examples you can share about what is "tricky" and requires human intervention regarding figuring out the differences?
Re: Question on stated alternative approach
by Shant Stepanian,
Your message is awaiting moderation. Thank you for participating in the discussion.
The quote you cited corresponds to the section "Stateful objects (e.g. tables)", notably this picture showing the table DDL before and after adding columns.
By "latest version of the schema in your source", I am referring to the ability to migrate from the "before" DDL to the "after" DDL by simply specifying the "after" DDL, instead of alter statements.
A couple things make it non-trivial:
1) Calculating the delta. Attempting that based on the raw SQL requires parsing the SQL, which we considered difficult as we’d have to account for various DBMS dialects and any DBMS-specific column information beyond the common name/type/nullability attributes (e.g. storage attributes).
We’d instead want some canonical way to represent the table definition that is easily parsable and diffable while still accounting for important DBMS-specific features. The createTable statement of Liquibase or an ORM model from a tool like Hibernate or Reladomo are examples of non-SQL representations that could potentially fit the bill.
2) How to apply the delta. Some migration types (e.g. adding a column) are easier to script than others (e.g. modifying a column type), and depending on the use case (e.g. DBMS or admin restriction) it may require recreating the table and migrating data over (i.e. create temp table with new DDL, copy data over, drop original table, rename temp table to original). Some teams may choose not to delegate defining such complex migrations to a program without getting the chance to review and edit it first, especially for large databases where data updates can take a long time and need to be scheduled. Also, assuming we went with a non-SQL way to represent the delta, we’d still have to consider how to add in any DBMS- or storage-specific attributes.
---
It is certainly not impossible to achieve such automation, esp. in the right circumstances with the right controls. We have considered adding such support in Obevo:
But again, we would not recommend for all cases and would still be supporting the stageful change type.
Hope this helps - let me know if you need more clarification.
Is there already a discussion forum?
by Juergen Donnerstag,
Your message is awaiting moderation. Thank you for participating in the discussion.
I've started migrating a not mid-size application to get some first hands experience. Out of that trial I have some questions. Is there already a place (forum, group, slack or irc channel) for such a discussion?
thanks
Juergen
Re: Is there already a discussion forum?
by Shant Stepanian,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi, the best place for a discussion at the moment is to raise an issue on our Github page. We may consider other avenues in the future
General Question about versioned stateless objects
by Michael Graham,
Your message is awaiting moderation. Thank you for participating in the discussion.
Shant, I apologize if this seems like a stupid question, but the concept of keeping versioned changelog copies for stateless objects seems inconsistent with a desire to follow a deployment pattern for standard application code. Wouldn't it be more logical that the user or pipeline performing the deploy has done a clone of the release version they want and should deploy the stateless version? I understand that keeping all the change migrations within the stateful objects is necessary to bring any starting base version up to the release, but that wouldn't be true for stateless objects. I shouldn't need to apply v1_procedureA.sql and v2_procedureA.sql. If I have designated the end state I want by the nature of cloning the release version I am thereby dictating I want v2 in final state. If you want v1 the pipeline / user should have cloned that release as they would have done with standard application code in my mind. This would keep only a single state version in source control with versioning managed by git.
Re: General Question about versioned stateless objects
by Shant Stepanian,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi Michael,
No worries, thanks for the question!
Your assessment is correct - we should not need to keep the versioned changeling copies for stateless objects.
That is indeed what we were able to accomplish in Obevo, though in hindsight of reading the article, that point may not have been clear when I detailed the approach we took. You can see this represented in the object examples in the “Ordering Algorithm” section; note how the VIEW1.sql file differs from the other files in not having the “//// CHANGE” section to demarcate incremental changes.
To summarize - the dilemma that tools generally face is how to handle both stateful and stateless objects. Representing stateful changes gives you a clear idea of the order, but is not as nice for representing objects that can change in place. Stateless objects can be modified in place, but then it is difficult to determine the deployment order, especially as we deal with hundreds or thousands of objects. The “Obevo Design: Ordering via Dependency Analysis” section highlights what we did to solve both problems.
Hope this helps!