Virtual Panel on Immutable Infrastructure
“Immutable Infrastructure” is a term that has been increasingly talked about lately among the Ops community. InfoQ reached out to experienced ops engineers to ask them what is the definition and borders of immutable infrastructure as well as its benefits and drawbacks, in particular when compared to current widespread “desired state” configuration management solutions. Is it a step forward or backwards in effective infrastructure management?
- Chad Fowler - CTO at Wunderlist
- Mark Burgess - CTO and founder of CFEngine
- Mitchell Hashimoto - CEO of HashiCorp, Creator of Vagrant, Packer, Serf
- Can you briefly explain your definition of immutable infrastructure?
- Can immutable infrastructure help prevent systems divergence while still coping with the need for regular configuration updates?
- Is immutable infrastructure a better way of handling infrastructure than desired state convergence? If yes what are its main advantages? If not what are its main disadvantages?
- What are the borders for immutable infrastructure, i.e. for which kind of changes would it make sense to replace a server vs updating its configuration/state?
- Is it possible to treat every server of a given infrastructure as immutable or are there some kind of server roles that cannot be treated that way?
- What are the main conceptual differences between tooling for immutable infrastructure and desired state convergence?
- Is the choice and maturity of today's tools and the immutable infrastructure pattern itself enough to become mainstream?
- Is immutable infrastructure applicable (and how) for organizations running their services on physical infrastructure (typically for performance reasons)?
- From an application's architecture and implementation perspective what factors need to be taken into account when targeting an immutable infrastructure?
- Does immutable infrastructure promote weak root cause analysis as it becomes so easy to trash a broken server to repair service instead of fixing it? If so is that a problem or just a new modus operandi?
InfoQ: Can you briefly explain your definition of immutable infrastructure?
Chad: I wrote about this on my blog several months ago in detail, but to me the high level essence of immutable infrastructure shares the same qualities that immutable data structures in functional programming have. Infrastructure components, like in-memory data structures, are running components that can be accessed concurrently. So the same problems of shared state exist.
In my definition of immutable infrastructure, servers (or whatever) are deployed once and not changed. If they are changed for some reason, they are marked for garbage collection. Software is never upgraded on an existing server. Instead, the server is replaced with a new functionally equivalent server.
Mitchell: Immutable infrastructure is treating the various components of your infrastructure as immutable, or unchangeable. Rather than changing any component, you create a new one with the changes and remove the old one. As such, you can be certain (or more confident) that if your infrastructure is already functioning, a change to config management for example won’t break what’s already not broken.
Immutable infrastructure is also sometimes referred to as “phoenix servers,” but I find that term to be less general, since immutability can also apply at the service-level, rather than just the server-level.
Mark: The term "immutable" is really a misnomer (if infrastructure were really immutable, it would not be much use to anyone: it would be frozen and nothing would happen). What the term tries to capture is the idea that one should try to pre-determine as many of the details as possible of a server configuration at the disk-image stage, so that "no configuration is needed". Then the idea supposes that this will make it fast and reliable to spin up machines. When something goes wrong, you just dispose of the machine and rebuild a new one from scratch. That is my understanding of what people are trying to say with this phrase.
I believe there's a number of things wrong with this argument though. The idea of what it means to fix the configuration is left very unclear and this makes the proposal unnecessarily contentious. First of all, pre-determining *everything* about a host is not possible. Proper configuration management deals with dynamic as well as static host state. Take the IP address and networking config, for instance, this has to be set after the machine is spun up. What about executing new programs on demand? What if programs crash, or fail to start? Where do we draw the line between what can and can't be done after the machine has been spun up? Can we add a package? Should we allow changes from DHCP but not from CFEngine or Puppet? Why? In my view, this is all change. So no machine can be immutable in any meaningful sense of the word.
The real issue we should focus on is: what behaviours do we want hosts to exhibit on a continuous basis. Or, in my language, what promises should the infrastructure be able to keep?
InfoQ: Can immutable infrastructure help prevent systems divergence while still coping with the need for regular configuration updates?
Chad: Absolutely. It's just a different model for "updates". Rather than update an existing system, you replace it. Ultimately I think this is a question of granularity of what you call a "component". 15 years ago, if I wanted to update a software component on a UNIX system, I upgraded the software package and its dependencies. Now I tend to view running server instances as components. If you need to upgrade the OS or some package on the system, just replace the server with one that's updated. If you need to upgrade your own application code, create new server instances with the new code and replace the old servers with it.
Mitchell: Actually, immutable infrastructure makes things a bit worse for divergence if you don’t practice it properly. With mutable infrastructure, the idea is that configuration management constantly runs (on some interval) to keep the system in a convergent state. With immutable infrastructure, you run the risk of deploying different versions of immutable pieces, resulting in a highly divergent environment.
This mistake, however, is the result of not properly embracing or adopting immutable infrastructure. With immutable infrastructure, you should never be afraid of destroying a component, so when a new version is available, you should be able to eventually replace every component. Therefore, your infrastructure is highly convergent. However, this is mostly a discipline and process thing, which is sometimes difficult to enforce in an organization.
Mark: I don't believe immutable infrastructure helps prevent systems divergence. Trying to freeze configuration up front leads to a "microwave dinner" mentality. Just throw your pre-baked package in the oven and suffer through it. It might be ok for some people, but then you have two problems: either you can't get exactly what you need, or you have a new problem of making and managing sufficient variety of prepackaged stuff. The latter is a harder problem to solve than just using fast model-based configuration management because it's much harder to see into pre-packaged images or "microwave dinners" to see what's in them. So you'd better get your packaging exactly right.
Moreover, what happens if there is something slightly wrong? Do you really want to go back to the factory and repackage everything just to follow the dream of the microwave meal? It is false that pre-packaging is the only way to achieve consistency. Configuration tools have proven that. You don't need to destroy the entire machine to make small repeatable changes cheaply. Would you buy a new car because you have a flat tyre, or because it runs out of fuel?
What prevents divergence of systems is having a clear model of the outcome you intend - not the way you package the starting point from which you diverge.
InfoQ: Is immutable infrastructure a better way of handling infrastructure than desired state convergence? If yes what are its main advantages? If not what are its main disadvantages?
Chad: I think so. Of course, there are tradeoffs involved and you have to weigh the options in every scenario, but I think immutable infrastructure is a better default answer than desired state convergence.
Immutable servers are easier to reason about. They hold up better in the face of concurrency. They are easier to audit. They are easier to reproduce, since the initial state is maintained.
Mitchell: It has its benefits and it has its downsides. Overall, I believe it to be a stronger choice and the right way forward, but it is important to understand it is no silver bullet, and it will introduce problems you didn’t have before (while fixing others).
The advantages are deployment speed, running stability, development testability, versioning, and the ability to roll back.
With immutable, because everything is “pre-compiled” into an image, deployment is extremely fast. You launch a server, and it is running. There may be some basic configuration that happens afterwards but the slow parts are done: compiling software, installing packages, etc.
And because everything is immutable, once something is running, you can be confident that an external force won’t be as likely to affect stability. For example, a broken configuration management run cannot accidentally corrupt configuration.
Immutable infrastructure is incredibly easy to test, and the test results are very accurate of what will actually happen at runtime. An analogy I like to make is that immutable infrastructure is to configuration management what a compiled application is to source code. You can unit test your source code, but when you go to compile it, there is no guarantee that some library versions didn’t change that could ruin your build. Likewise, with configuration management, you can run it over and over, but you can’t guarantee that if it succeeds that it’ll still succeed months down the road. But with a compiled application, or a pre-built server, all the dependencies are already satisfied and baked in; the surface area of problems that can happen when you go to launch that server are much much smaller.
Versioning is much simpler and clearer because you can tag a specific image with the configuration management revision that is baked into it, the revision of an application, the versions of all dependencies, etc. With mutable servers, it’s harder to be certain what versions or revisions of what exist on each server.
Finally, you get rollback capability! There are many people who think “rollback is a lie,” and at some point it is. But if you practice gradual incremental changes to your infrastructure, rolling back with immutable infrastructure to a recently previous version is cheap and easy: you just replace the new servers with servers launched from a previous image. This has definitely saved us some serious problems a few times, and is very hard to achieve with desired state configurations.
Mark: The way to get infrastructure right is to have a continuous knowledge relationship with system specifications (what I call promises in CFEngine language). To be fit for purpose, and to support business continuity, you must know that you can deliver continuous consistency. Changing the starting point cannot be the answer unless every transaction is designed to be carried out in separately built infrastructure. That's a political oversimplification, not a technical one, and it adds overhead.
I would say that the "immutable" paradigm is generally worse than one that that balances a planned start-image with documented convergent adaptation. The disadvantage of a fixed image is a lack of transparency and immediacy. Advocates of it would probably argue that, if they know what the disk image version is, they have a better idea of what the state of the machine is. The trouble with that is that system state is not just what you feed in at the start, it also depends on everything that happens to it after it is running. It promotes a naive view of state.
At a certain scale, pre-caching some decision-logic as a fixed image might save you a few seconds in deployment, but you could easily lose those seconds (and a lot more business continuity) by having to redeploy machines instead of adapting and repairing simple issues. If there is something wrong with your car, it gets recalled for a patch; you don't get a new car, else the manufacturers would be out of business.
Caching data can certainly make sense to optimize effort, as part of an economy of scale, but we should not turn this into a false dichotomy by claiming it is the only way. In general an approach based on partial disk images, with configuration management for the "last mile" changes makes much more business sense.
In practice, "immutability" (again poor terminology) means disposability. Disposability emerges often in a society when resources seem plentiful. Eventually resources become less plentiful, and the need to handle the waste returns. At that stage we start to discover that the disposable scheme was actually toxic to the environment (what are the side-effects of all this waste?), and we wonder why we were not thinking ahead. We are currently in an age of apparent plenty, with cloud environments hiding the costs of waste, so that developers don't have to think about them. But I wonder when the margins will shrink to the point where we change our minds.
InfoQ: What are the borders for immutable infrastructure, i.e. for which kind of changes would it make sense to replace a server vs updating its configuration/state?
Chad: If there are borders and some changes are done on an existing server, the server isn't immutable. Idealistically, I don't think we should allow borders. "Immutable" isn't a buzz word. It has meaning. We should either maintain the meaning or stop using the word. An in-between is dangerous and may provide a false sense of security in the perceived benefits of immutability.
That said, systems do need to be quickly fixable, and the methods we're currently using for replacing infrastructure are slower than hot-fixing an existing server. So there needs to be some kind of acceptable hybrid which maintains the benefits of immutability. My current plan for Wunderlist is to implement a hotfix with a self-destruct built in. So if you have to hot-fix a server it gets marked to be replaced automatically. We haven't done it automatically yet, but we've manually done this and it works well. I see this as an ugly optimization rather than a good design approach.
Mitchell: The borders of immutable infrastructure for me break down to where you want to be able to change things rapidly: small configuration changes, application deploys, etc.
But wanting to be able to deploy an application on an immutable server doesn’t make that server immutable. Instead, you should think of the immutability of a server like an onion: it has layers. The base layer of the server (the OS and some configuration and packages) is immutable. The application itself is its own immutable component: a hopefully pre-compiled binary being deployed to the server. So while you do perhaps have an arguably mutable component in your server, it itself is another versioned immutable component.
What you don’t want to be doing for application deploys on immutable infrastructure is to be compiling live on an immutable server. The compilation might fail, breaking your application and perhaps the functionality of the server.
Mark: When a particular configuration reaches the end of its useful life, it should probably be replaced. That ought to be a business judgement, not a technical one. The judgement can be made on economic grounds, related to what would be lost and gained by making a change. But be careful of the hidden costs if your processes are not transparent and your applications are mission critical.
Anytime you have to bring down a system for some reason, it could be an opportunity to replace it entirely without unnecessary interruption, as long as you have a sufficient model of the requirements to replace it with minimum impact, and a hands-free approach to automation for creating that environment. Today, it is getting easy to deploy multiple versions in parallel as separate "branches" for some kinds of applications. But we have to remember that the whole world is not in the cloud. Planes, ships, rockets, mobile devices are all fragile to change and mission critical. There are thus still embedded devices that spend much of their time largely offline, or with low rate communications. They cannot be re-imaged and replaced safely or conveniently, but they can be patched and managed by something like CFEngine that doesn't even need a network connection to function.
InfoQ: Is it possible to treat every server of a given infrastructure as immutable or are there some kind of server roles that cannot be treated that way?
Chad: We have what I call "cheats" with immutable infrastructure. Relational databases are a good example. I think it's possible to work with them immutably, but so far it hasn't been worth the effort for us. If we were an infrastructure vendor I would be applying some effort here, but since we're in the business of making applications for our customers, we have been content to outsource more and more to managed services such as Amazon's RDS.
My goal is that our entire infrastructure consists of either pieces we don't manage directly or components that are completely replaceable. We're almost there and so far it's been a very positive experience.
Mitchell: It is possible, but there are roles that are much easier to treat as immutable. Stateless servers are extremely easy to make immutable. Stateful servers such as databases are much trickier, because if you destroy the server you might be destroying the state as well, which is usually unacceptable.
Mark: Mission critical servers running monolithic applications cannot generally be managed in this disposable manner. As I understand it, the principal argument for this pattern of working is one of trust. Some people would rather trust an image than a configuration engine. One would like to allow developers to manage and maintain their own infrastructure requirements increasingly. However, if you force everyone to make changes only through disk images, you are tying their hands with regard to making dynamical changes, such as adjusting the number of parallel instances of a server to handle latency, and tuning other aspects of performance. Disregarding those concerns in a business decision. Developers often don't have the right experience to understand scalability and performance, and certainly not in advance of deployment.
InfoQ: What are the main conceptual differences between tooling for immutable infrastructure and desired state convergence?
Chad: Desired state convergence is in a different realm of complexity to implement. It's a fascinating idea, but at least for my use cases it's outdated. The longer a system lives, the more afraid of it I become. I can't be 100% sure that it is configured the way I want and that it has exactly the right software. Thousands upon thousands of developer hours have gone into solving this problem.
In the world of immutable infrastructure I think of servers as replaceable building blocks, like parts in an automobile. You don't update a part. You just replace it. The system is the sum of its parts. The parts are predictable since they don't change. It's conceptually very simple.
Mitchell: The tooling doesn’t change much! Actually, all tools used for desired state convergence are equally useful for immutable infrastructure. Instead, immutable infrastructure adds a compilation step to servers or applications that didn’t exist before. For example, instead of launching a server and running Chef, you now use Packer as a compilation tool to launch a server, run Chef, and turn it into an image.
One thing that does change is a mindset difference: immutable tools know they can only run once to build an image, whereas desired state convergence tools expect that they can run multiple times to achieve convergence. In practice, this doesn’t cause many problems because you can just run the desired state convergence tool multiple times when building an image. However, the tools built for immutability tend to be much more reliable in achieving their intended purpose the first time.
Mark: If you want to freeze the configuration of a system to a pre-defined image, you have to have all the relevant information about its environment up front, and then you are saying that you won't try to adapt down the line. You will kill a host to repair the smallest headache. It's an overtly discontinuous approach to change, as opposed to one based on preservation and continuity. If you think of biology, it's like tissue, where you can lose a few cells from your skin because there are plenty more to do the job. It can only work if resources are plentiful and redundant.
With desired-state convergence, you can make a system completely predictable with repairs and even simple changes in real time and respond to business problems on the fly, at pretty much any scale, making only minimal interventions. This is like the role of cellular DNA in biology. There are repair processes on-going because there is no redundancy at the intra-cellular level.
Bulk information is easier to manage from a desired state model than from piles of bulk data because it exploits patterns to good advantage. You can easily track changes to the state (for compliance and auditing purposes) because a model defines your standard of measurement over many versions. Imagine compliance auditing like PCI or HIPPA. How do you prove to an auditor that your system is compliant? If you don't have a model with desired outcome, that becomes a process of digging around in files and looking for version strings. It's very costly and time-wasting.
InfoQ: Is the choice and maturity of today's tools and the immutable infrastructure pattern itself enough to become mainstream?
Chad: Probably not. The foundations are getting better and better with both hosted and internal cloud providers and frameworks, but creating a solid immutable architecture is not currently the path of least resistance. I think most of us will move in this direction over time, but it's currently far from mainstream.
Mitchell: Not yet, but they’re leaps and bounds better than they were a year ago, and they’re only going to continue to become more mature and solve the various problems early adopters of immutable infrastructure may be having.
All the tools are definitely mature enough to begin experimenting with and testing for some aspects of your infrastructure, though.
Mark: I don't believe it can become mainstream unless all software becomes written in a completely stateless way, which would then be fragile to communication faults in a distributed world. Even then, I don't think it is desirable. Do we really want to argue that it is better for the whole world should eat microwave dinners, or to force chefs to package things in plastic before eating it? If history has taught us anything, it is that people crave freedom. We have to understand that disposability is a large scale economic strategy that is just not suitable at all scales.
InfoQ: Is immutable infrastructure applicable (and how) for organizations running their services on physical infrastructure (typically for performance reasons)?
Chad: Sure. While perhaps the servers themselves would run longer and probably require in-place upgrades to run efficiently, with the many options available for virtualization, everything on top of that is fair game. I suppose it would be possible to take the approach further down the stack, but I haven't had to do it and I don't want to speculate.
Mitchell: Yes, but it does require a bit more disciplinary work. The organization needs to have in place some sort of well automated process for disposing of and re-imaging physical machines. Unfortunately, many organizations do not have this, which is somewhat of a prerequisite to making immutable infrastructure very useful.
For example, what you really want is something like Mesos or Borg for physical hardware.
Mark: The immutable infrastructure idea is not tied specifically to virtualization. The same method could be applied to physical infrastructure, but the level of service discontinuity would be larger.
Today, immutability is often being mixed up with arguments for continuous delivery in the cloud, but I believe that disposable computing could easily be contrary to the goals of continuous delivery because it adds additional hoops to jump through to deploy change, and makes the result less transparent.
InfoQ: From an application's architecture and implementation perspective what factors need to be taken into account when targeting an immutable infrastructure?
Chad: Infrastructure and services need to be discoverable. Whatever you're using to register services needs to be programmable via an API. You need to have intelligent monitoring and measuring in place, and your monitoring needs to be focused less on the raw infrastructure than on the end-purpose of the service than you're probably used to.
Everything needs to be stateless where possible.
As I mentioned previously, we have "cheats" like managed database systems. I'm not going to comment on how you have to change architecture to have your own immutable, disposable database systems since it's thankfully not a problem I've needed or wanted to solve yet.
Mitchell: Nothing has to change in order to target immutable infrastructure, but some parts of architecting and developing an application become much easier. Developers in an immutable infrastructure have to constantly keep in mind that any service they talk to can die at any moment (and hopefully replaced rather quickly). This mindset alone results in developers generally building much more robust and failure-friendly applications.
With strongly coupled, mutable infrastructures, it isn’t uncommon to interrupt a dependent service of an application, and have that application be completely broken until it is restarted with the dependent service up.
While keeping immutable infrastructure in mind, applications are much more resilient. As an example from our own infrastructure managing Vagrant Cloud, we were able to replace and upgrade every server (our entire infrastructure) without any perceivable downtime and without touching the web frontends during the replacement process. The web applications just retried some connects over time and eventually came back online. The only negative experience was that for some people their requests were queued a bit longer than usual!
Mark: The aim should not be to make applications work around an immutable infrastructure. You don't pick the job to fit the tools. The immutable infrastructure is usually motivated as a way of working around the needs of application developers. The key question is: how do you optimize a continuous delivery pipeline?
InfoQ: Does immutable infrastructure promote weak root cause analysis as it becomes so easy to trash a broken server to repair service instead of fixing it? If so is that a problem or just a new modus operandi?
Chad: I don't think so. In the worst case, it doesn't change how we do root cause analysis, because replacing and rebooting are sort of the same last-ditch effort when something is wrong. In the best case, it makes it easier to experiment with possible solutions, tweak variables one at a time, bring up temporary test servers, swap production servers in and out, etc.
I see the point you're hinting at in the question, though. There may be a class of problems that is all but eliminated (read: obscured to the point of essentially not existing) if the average life span of a server is less than one day. It may also be harder to pinpoint these problems if they do occasionally pop up without knowing to try server longevity as a variable.
Maybe that's OK.
Mitchell: Because it is very likely that the server you’re replacing it with will one day see that same issue, immutable infrastructure doesn’t promote any weaker root cause analysis. It may be easier to ignore for a longer period of time, but most engineering organizations will care to fix it properly at some point.
Actually, I would say the root cause analysis becomes much stronger. Since the component is immutable and likely to exhibit the same problems under the same conditions, it is easier to reproduce, identify, fix, and finally deploy your change out across your entire infrastructure.
Additionally, desired state configuration has a high chance of making the problem worse: a scheduled run of the configuration management system may mask the real underlying issue, causing the ops team to spend more time trying to find it or even to just detect it.
Mark: Disposal of causal evidence potentially makes understanding the environment harder. Without model-based configuration a lot of decisions get pushed down into inscrutable scripts and pre-templated files, where the reason for the decisions only lives in someone developer's head. That process might work to some extent if the developer is the only responsible for making it, but it makes reproducibility very challenging. What happens when the person with that knowledge leaves the organization?
In psychology, one knows that humans cannot remember more than a small number of things without assistance. The question is: how do you create a knowledge-oriented framework where intent and outcome are transparent, and quickly reproducible with a minimum of repeated effort. This is what configuration management was designed for and I still believe that it is the best approach to managing large parts of the infrastructure configuration. The key to managing infrastructure is in separating and adapting different behaviours at relevant scales in time and space. I have written a lot about this in my book “In Search of Certainty: The science of our information infrastructure”.
About the Panelists
Chad Fowler Chad Fowler is an internationally known software developer, trainer, manager, speaker, and musician. Over the past decade he has worked with some of the world's largest companies and most admired software developers. Chad is CTO of 6Wunderkinder. He is the author or co-author of a number of popular software books, including Rails Recipes and The Passionate Programmer: Creating a Remarkable Career in Software Development.
Mark Burgess Mark Burgess is the CTO and Founder of CFEngine, formerly professor of Network and System Administration at Oslo University College, and the principal author of the Cfengine software. He’s the author of numerous books and papers on topics from physics, Network and System Administration, to fiction.
Mitchell Hashimoto is best known as the creator of Vagrant and founder of HashiCorp. He is also an O’Reilly author and professional speaker. He is one of the top GitHub users by followers, activity, and contributions. “Automation obsessed,” Mitchell strives to build elegant, powerful DevOps tools at HashiCorp that automate anything and everything. Mitchell is probably the only person in the world with deep knowledge of most virtualization hypervisors.
Great panel. This is how software will be done in the future.
I'm in favour of creating building blocks and workflows that mean it's quick and easy to build and replace the VMs that we routinely deploy. An example of how future applications can be created/deployed is the unikernel approach of which Open Mirage (openmirage.org) is the most advanced. Essentially, everything can be managed via version control and the OS is merely a bunch of libraries which provide the desired functionality (e.g a TCPIP lib for networking). In general, this seems to follow Chad's view.
We're using Mirage as a core piece of a new toolstack (nymote.org) to enable the creation of resilient applications that lend themselves to the 'immutable infrastructure' approach.
Anatole Tresch Mar 03, 2015