DevOps @ large investment bank
DevOps @ large investment bank
I have recently worked for a large investment bank where I was part of the architecture/development-tooling team and where I spent a lot of time trying to solve the increasingly harmful problems of software by introducing DevOps.
Let me start by giving you my view on this whole subject of DevOps (highly influenced by the excellent presentation of John Allspaw and Paul Hammond of Flickr):
- The business – whether that be the sponsor or the end-user – wants the software to change, to adapt it to the changing world it represents, and it wants this to happen fast (most of the time as fast as possible)
- At the same time, the business wants the existing IT services to remain stable or at least not disrupted from the introduction of changes
The problem with the traditional software delivery process (or the lack thereof) is that it is not well adapted to support these two requirements simultaneously. So companies have to choose between either delivering changes fast and ending up with a messy production environment or keeping a stable but outdated environment.
This doesn’t work very well. Most of the time they will still want both and therefore put pressure on the developers to deliver fast and on the ops guys to keep their infrastructure stable. It is no wonder that dev and ops will start fighting with each other to protect their objectives and as a result they will gradually start drifting away from each other, leaving the Head of IT somewhere stuck inside the gap that arises between the two departments.
This picture pretty much summarizes the position of the Head of IT:
You can already guess what will happen with the poor man when the business sends both horses into different directions.
The solution is to somehow redefine the software delivery process in order to enable it to support these two requirements – fast and stable – simultaneously.
But how exactly should we redefine it? Let us have a look at this question from the point of view of the three layers that make up the software delivery process: the process itself, the tooling to support it and the culture (i.e. the people who use it):
First of all the process should be logic and efficient. It’s funny how sometimes big improvements can be made by just drawing the high-level process on a piece of paper and removing the obvious inconsistencies.
The process should initially take into account the complexities of the particular context (like the company’s application landscape, its technologies, team structure, etc) but in a later stage it should be possible to adapt the context in order to improve the process wherever the effort is worth it (e.g. switch from a difficult to automate development technology to an easy to automate one).
Secondly, especially if the emphasis is on delivering the changes fast, the process should be automated where possible. This has the added benefit that the produced data has a higher level of confidence (and therefore will more easily be used by whoever has an interest) and that the executed workflows are consistent with one another and not dependent on the mood or fallibility of a human being.
Automation should also be non-intrusive with regards to human intervention. With this I mean that whenever a situation occurs that is not supported by the automation (e.g. because it is too new, too complex to automate and/or happens only rarely) it should be possible for humans to take over from there and when the job is done give control back to the automation. This ability doesn’t come for free but must explicitly be designed for.
And then there is the cultural part: everyone involved in the process should get at least a high-level understanding of the end-to-end flow and in particular a sufficient knowledge of their own box and all interfacing boxes.
It is also well-known that people resist change. It should therefore not come as a surprise that they will resist changes to the software delivery process, a process that has a huge impact on the way they work day-to-day.
Now back to reality. The solution that I implemented consisted in:
- The organization-wide alignment of the configuration and release management processes
- The automation of these processes in order to facilitate the work of the release management team
The biggest problems that existed in a traditional company were:
- The exponentially increasing need for release coordination, which was characterized by its manual nature, and the pressure it exerted on the deployment window
- The inconsistent, vague, incomplete and/or erroneous deployment instructions
- The problems surrounding configuration management: being too vague and not reliable due to the manual nature of keeping track of them
- The manual nature of testing happening at the end of a long process meaning that it must absorb all upstream planning issues - this eventually causes the removal of any not signed-off change requests
These problems were caused (or at least permitted) by the way the IT department was organized, more precisely:
- The heterogeneity of the application landscape, the development teams and the ops teams
- The strong focus on the integration of the business applications and the high degree of coupling between them
- The manual nature of configuration management, release management, acceptance testing, server provisioning and environment creation
- The low frequency of the releases
Let us have a look now at the first steps we took in getting these problems solved: bringing configuration management under control.
It was quite obvious to me that we should start by bringing configuration management under control as it is the core of the whole ecosystem. A lot of building blocks were already present, it was only a matter of gluing them together and filling up the remaining gaps
This first step consisted of three sub-steps:
- Getting a mutual agreement on the structure of configuration management
- The implementation of a configuration management system (CMS) and a software repository - or definitive media library (DML) in ITIL terms
- The integration with the existing change management tool, build/continuous integration tools and deployment scripts
We decided to create three levels of configuration items (CI's): on top the level of the business application. Business applications can be seen as the "units of service" that the IT department provides to the business. Below is the level of the software component. A software component refers to a deployable software package like third party application, a web application, a logical database or a set of related ETL flows and consists of all files that are needed to deploy the component. And on the bottom the level of the source code that consists of the files that are needed to build the component.
Here is a high-level overview of the three levels of CI's:
And here is an overview of the three levels of configuration management for a sample application CloudStore:
CI's are not static. They must change over time as part of the implementation of new features and these changes must be tracked by versioning the CI's. So you could see a version number as a way of uniquely identifying a particular set of features that are contained within the CI.
Here is an example of how the application and its components are versioned following the implementation of a change request:
Once this structure was defined, the complete list of business applications was created and each application was statically linked to the components it was made up of as well as to the development team they were maintained by. Change requests that were created for a particular application could not be used anymore to change components that belonged to a different application. And deployment requests that were created for a particular application could not be used anymore to deploy components of a different application. In these cases a new change request or deployment request had to be created for this second application.
So the resulting conceptual model looked as following:
(Click on the image to enlarge it)
Following the arrows: an application contains one or more software components. There can be multiple change requests per application and release, but only one deployment request and it is used to deploy one, some or all of the software components.
Source code is stored and managed by version control systems. This level is well documented and there is good support by tooling (git, svn, TFS, ...) so I will not further discuss it here.
On the level of the components, the files that represent the component (executables, libraries, config files, scripts, setup packages, ...) are typically created from their associated source code by build tools and physically stored in the DML. All relevant information about the component (the application it belongs to, the context in which it was developed and built, ...) is stored in the CMS.
The business application has no physical presence, it only exists as a piece of information in the CMS and is linked to the components it is made up from.
Here is an overview of the implementation view:
Once these rules were agreed, a lightweight tool was built to serve as both the CMS and DML. It was integrated with the build tools in such a way that it automatically received and stored the built files after each successful build of a particular component. By restricting the upload of files to exclusively the build tools (which in turn assure that the component has successfully passed the unit-tests and the deployment test to a development server) at least a minimum level of quality was assured. Additionally, once a particular version number of a component was uploaded, it was frozen. Attempts to upload a newer "version" of the components with a version number that was already present would fail.
The build tools not only sent the physical files, but also all the information that was relevant to the downstream processes: the person who built it, when it was built, the commit messages (a.k.a. check-in comments) since the previous version, the file diffs since the previous version, ...
With this information the CMS was able to calculate some interesting pieces of information that used to be hard to manually keep track of before, namely the team that is responsible for doing the deployment and the logical server group (e.g. "Java web DEV" or ".NET Citrix UAT") to deploy to. Both were functions of the technology, the environment and sometimes other parameters that were part of the received information. As these calculation rules were quite volatile, they were implemented in some simple scripts that could be modified on-the-fly by the administrator of the CMS whenever the rules changed.
The CMS also parsed the change request identifiers from the commit messages and retrieved the relevant details about it from the change management tool (another integration we have implemented): the summary, type, status and the release for which it was planned.
The presence of the core data that came from the build tools combined with the calculated data and especially the data retrieved from the change management tool transformed the initially quite boring CMS to a small intelligence center. It became possible now to see the components (and their versions) that implemented a particular change request or even those that implemented any change requests for a particular release. (All of this assumes off course that the commit messages correctly contain the identifiers of the change request.)
In the following mock up you can see how the core data of a component is extended with information from change management and operations:
(Click on the image to enlarge it)
It's also interesting to look at configuration management from a higher level, from the level of the application for example:
(Click on the image to enlarge it)
By default all the components of the selected application were shown and for each component all versions since the version that was currently deployed in production were listed in descending order. In addition, the implemented change requests were also mentioned. (Note that the mock up also shows which versions of the component were deployed in which environment, more on this later about the implementation of a release management tool.)
The CMS also contained logic to detect a couple of basic inconsistencies:
- A particular version of a component didn't implement any change requests (in that case it was still possible to manually associate a change request with the version)
- A component version implemented change requests that were planned for different releases (would be quite difficult to decide when to deploy it right?)
- A change request that was planned for a particular release was not implemented by any components
All of this information facilitated the work of the developer in determining the components he had to include in his deployment request for the next release. But it was also interesting during impact analysis (e.g. after a change request was removed from a release) to find out more information about an individual component or about the dependencies between components.
This ability to do impact analysis when a change request had to be removed from a release was a big deal for the company. One of the dreams of the people involved in this difficult task was the ability to get a complete list of all change requests that depend on this change request. Although it was not actually developed initially, it would not be very difficult to do now that all the necessary information was available in the CMS. The same could be said about including more intelligent consistency checks: it's quite a small development effort for sometimes important insights that could save a lot of time.
Finally the deployment scripts of all technologies were adapted in such a way that they always retrieved the deployable files from the DML and as such finalizing the last step of a fully controlled software delivery pipeline from version control tool to production.
Here is an overview of how the CMS and DML were integrated with the existing tools:
(Click on the image to enlarge it)
You may have noticed that I have emphasized throughout this post the importance of keeping track of the dependencies between the change requests and the components so we are able to do proper impact analysis when the need arises. These dependencies are the cause of many problems within the software delivery process and therefore a lot of effort has to be put into finding ways to remove or at least limit their negative impact. The solution I mentioned here was all about visualizing the dependencies which is only the first step of the solution.
A much better strategy would be to avoid these dependencies in the first place. And the best way to do this is by simply decreasing the size of the releases which comes down to increasing the frequency of the releases. When releases happen infrequently, the change requests typically stack up to a large pile and in the end all change requests are dependent on one another due to the technical dependencies that exist through their components. You remove one and the whole card house collapses. But increasing the release frequency requires decent automation and this is exactly what we're working on! But until the whole flow is automated and the release frequency can be increased we have to live with this problem.
If we take this strategy of increasing the release frequency to its extremes we will end up with continuous delivery, where each commit to each component triggers a new mini-release, one that contains a one-step deployment, that of the component that was committed. No more dependencies, no more impact analysis, no more problems! Nice and easy! Nice? Yes! Easy? Maybe not so. Because this approach doesn't come for free.
Let me explain.
With the batch-style releases at least you could assume that whenever the new version of your component is deployed into an environment it will find the new versions of all components it depends on (remember that most of our features required changes to multiple components). It doesn't have to take into account that old versions of these components may still hang around. With continuous delivery, this assumption is not guaranteed anymore in my opinion. It's now up to the developer to make sure that his component supports both the old and the new functionality and that he includes a feature flag to activate the feature only when all components are released. In some organizations (and I'm thinking about the large traditional ones with lots of integrated applications) this may be a high price to pay.
Time now to head over to the second step of the solution: bringing release management under control.
A need for orchestration
As already mentioned, the applications in our company were highly integrated with each other and therefore most of the new features required modifications in multiple applications, each of which managed by its own dedicated development team.
This seemingly innocent fact had a rather big consequence: it resulted in the need for a release process that is capable of deploying the modified components of each of these impacted applications in the same go, in other words an orchestrated release process. If the component would be deployed on different days it would break or at least disrupt the applications.
In fact, even if a feature can be implemented by only modifying components within one application at least some degree of orchestration is necessary (a good example is a modification in the source code and one in the database schema). But as this would stay within the boundaries of one development team it would be relatively easy to organize that all components are deployed simultaneously.
Things become way more complicated when more than one application and development team is involved in implementing a feature and moreover when multiple of these application-overlapping features are being worked on in parallel and thereby modifying the same components. Such a situation really shouts for a company-wide release calendar that defines the exact dates when components can be deployed in each environment. If we deploy all components at the same moment we are sure that all the dependencies between them are taken into account.
Unfortunately we all know that creating strict deadlines in a context so complex and so difficult to estimate as software development can cause a lot of stress. And cause a lot of missed deadlines as well, resulting in half-finished features that must somehow be taken out of the ongoing release and postponed to the next release. And a lot of the dependent features that must be taken out as well. Or features that can stay in but were rushed so much to get them finished before the deadline that they lack the quality that is expected from them, causing system outages, urgent bug fixes that must be released short after the initial release, loss of reputation, and so on downstream in the process. And even if all goes according to the plan, there is generally a lot of unfinished work waiting somewhere in the pipeline for one or another deployment (development-done but not release-done, also called WIP or work in progress).
Continuous deployment takes another approach to avoid all these problems that are caused by these dependencies: it requires the developers to support backward-compatibility, in other words their component should work with the original versions as well as with the modified versions (those that include the new feature) of the other components, simply because they don't know in advance in which order the components will be deployed. In such a context the developers can do their work on their own pace and deliver their components whenever they are ready. As soon as all the new versions of the components are released a feature flag can be switched on to activate the feature.
There is unavoidably a huge drawback to this scenario: in order to support this backward compatibility the developers basically have to keep two versions of the logic in his code, and this applies to each individual feature that is being worked on. This requirement can be a big deal, especially if the code is not well-organized. Once the feature is activated they should also not forget to remove the old version to avoid that the code base becomes a total mess after a while. If there are changes to the database scheme (or other stateful resources) and/or data migrations involved things will become even more complicated. Continuous deployment is also tightly coupled to hot deployment, which also introduces quite some challenges on its own and if the company doesn't have a business need to be up all the time that's a bit of a wasted effort. I found a nice explanation of all the intricacies of such practices in this webinar by Paul Biggar at MountainWest RubyConf 2013.
But don't get me wrong on this, continuous deployment is great and I really see it as the way to go but that doesn't take away that it will take a long time for the "conventional" development community to switch to a mindset that is so extremely different from what everyone has been used to for so long. For this community (the 99%-ers ;-)) continuous deployment merely appears as a small dot far away on the horizon, practiced by whizkids and Einsteins living on a different planet. But hopefully, if we can gradually bring our conventional software delivery process under control, automating where possible and gradually increasing the release frequency, maybe one day the leap towards continuous deployment may be less daunting than it is today and instead just be the next incremental step in the process.
But until then I'm afraid we are stuck with our orchestrated release process so we better make sure that we bring the processes and dataflows under control so we can keep all the aforementioned problems it brings with it to a minimum.
Let us have a closer look at the orchestrated release process in our company, starting from the moment the components are delivered to the software repository (typically done by a continuous integration tool) to the moment they are released to production.
The first step was for the dev team to create a deployment request for their application. This request should contain all the components - including the correct version numbers - that implement the features that were planned for the ongoing release.
Each team would then send their deployment request to the release coordinator (a role on enterprise level) for the deployment of their application in the first "orchestrated" environment, in our case the UAT environment. Note that we also had an integration environment between development and UAT where cross-application testing - amongst other types of testing - happened but this environment was still in the hands of the development teams in terms of deciding when to install their components.
The release coordinator would review the contents of the deployment requests and verify that the associated features were signed-off in the previous testing stage, e.g. system testing. Finally he would assemble all deployment requests into a global release plan and include some global pre-steps (like taking a backup of all databases and bringing all services down) and post-steps (like restarting all services and adding an end-to-end smoke-test task). On the planned day of the UAT deployment, he would coordinate the deployments of the requests with the assigned ops teams and inform the dev teams as soon as the UAT environment was back up and running.
When a bug was found during the UAT testing, the developer would fix it and send in an updated deployment request, one where the fixed components got an update version number and were highlighted for redeployment.
The deployment request for production would then simply be a merge of the initial deployment request and all "bugfix" deployment requests, each time keeping the last deployed version of a component. Except if it's a stateful component like a database, in which case deployments are typically incremental and as a result all correction deployments that happened in UAT must be replayed in production. Stateless components like the ones that are built from source code are typically deployed by completely overwriting the previous version.
Again, the release coordinator would review and coordinate the deployment requests for production, similar to how it was done for UAT and that would finally deliver the features into production, after a long stressful period.
Note that I only touched the positive path here. The complete process was a lot more complex and included scenarios like what to do in case a deployment failed, how to treat rollbacks, how to support hotfix releases that happen while a regular release is ongoing, etc.
For more information on this topic you should definitely check out Eric Minick's webinar on introducing uRelease in which he does a great job explaining some of the common release orchestration patterns that exist in traditional enterprises.
As long as there were relatively few dev and ops teams and all were mostly co-located this process could still be managed by a combination of Excel, Word, e-mail, and a lot of plain simple human communication. However, as the IT department grew over time and became more spread out over different locations, this "artisanal" approach hit its limits and a more "industrial" solution was needed.
On the tooling side we chose a proprietary tool that helped us industrialize our release management process.
First of all, it allows the release manager (typically a more senior role than the release coordinator) to configure her releases, and this includes specifying the applicable deployment dates for each environment. It also allows the developers to create a deployment request for a particular application and release which contains a set of deployment steps, one for each component that must be deployed. These steps can be configured to run sequentially or in parallel and manual (as in our case, thanks for not allowing access to the production servers Security team ;-)) or in automated fashion.
On the deployment date the deployment requests for a particular release can then be grouped into a release plan - typically after all deployment requests are received - that allows to create release-specific handling like adding pre and post deployment steps.
And finally during the day of the deployment the release plan will be executed, going over each deployment step of each deployment request either sequentially or in parallel. For each manual step, the ops team responsible for the deployment of the associated component will receive a notification and will be able to indicate success or failure. For each automated step an associated script is run that will take care of the deployment.
See here a mockup of how a deployment request looks like:
(Click on the image to enlarge it)
As part of the implementation project in our company the tool was integrated with two other already existing tools: the change management tool and the software repository.
The release management tool was notified by the change management tool whenever a change request (or feature) was updated. This allowed the tool to show the change requests that applied to that particular application and release directly on the deployment request which made it possible for the release coordinator to easily track the statuses of the change requests and for example reject the deployment requests that contain not yet signed-off change requests.
The release management tool was also notified by the software repository whenever a new version of a component was built which allowed the tool to restrict the choices of the components and their version number to only those that actually exist in the software repository.
See here an overview of the integration of the release management tool with the other tools:
(Click on the image to enlarge it)
More generally, by implementing a tool for release management rather than relying on manual efforts it became possible to increase the quality of the information - this was done by either enforcing that correct input data is introduced or by validating it a posteriori through reporting - and to provide a better visibility on the progress of the releases.
To conclude, let us now take a step back and see which problems that were identified initially got solved by implementing a configuration management tool and a release management tool:
- The exponentially increasing need for release coordination, which was characterized by its manual nature, and the pressure it exerted on the deployment window => SOLVED
- The inconsistent, vague, incomplete and/or erroneous deployment instructions => SOLVED
- The problems surrounding configuration management: not being enough under control and allowing for a too permissive structure => SOLVED
- The manual nature of testing happening at the end of a long process meaning that it must absorb all upstream planning issues - this eventually causes the removal of any not signed-off change requests => NOT SOLVED YET
- The desire by the developers to sneak in late features into the release and thereby bypassing the validations => SOLVED
Looks good doesn't it? Of course this doesn't mean that what was implemented is perfect and doesn't need further improvement. But for a first step it solved quite a number of urgent and important problems. It is time now for these tools to settle down and to put them under the scrutiny of continuous improvement before heading to the next level.
About the Author
Niek Bartholomeus has been a devops evangelist in a large financial institution for the last five years where he was responsible for bringing together the dev and ops teams, on a cultural as well as a tooling level. He has a background as a software architect and developer and is fascinated by finding the big picture out of the smaller pieces. He can be found on Twitter @niekbartho and on his blog.
Caitie McCaffrey Apr 24, 2015