"In Search of Certainty" - Book Review and Interview with Mark Burgess
"In Search of Certainty - The Science of Our Information Infrastructure", written by CFEngine's creator, Mark Burgess, takes us into a fascinating journey through physics, biology and what we can learn from them and apply to the information infrastructure realm.
The author uses the realms of physics and biology to assert that uncertainty is an unescapable fact of life. It proposes promise theory as the best model to cope with that uncertainty. A world of promises is a world where autonomous agents publish some intended behaviour through the form of promises. Since the promise may not have been verified, there must exist a degree of trust between the agents. This trust can be built on the verification of previous promises being kept.
The set of agent's promises allow for the creation of reasoning networks, or graphs, all based on voluntary commitments. This model of the world is fundamentally different from a command-and-control one, where some central authority orders agents to behave in some way. While the former says that it is not possible to have complete control over the hundreds or thousands of agents of a given system, the latter assumes that it is indeed possible. Thinking on configuration management tools, pull-based model ones are closer to the promises world while push-based ones are closer to command-and-control. Mark Burgess co-authored a book focused on promise theory, where it presents CFEngine 3 as a promise keeping engine.
Mark Burgess believes that command-and-control models cannot cope with the scale, complexity and reasoning demands of modern infrastructures:
We suffer sometimes from the hubris of believing that control is a matter of applying sufficient force, or a sufficiently detailed set of instructions.
Computer Immunology is described in detail in the book. Mark Burgess introduced the concept in an influential paper in 1998. He argues that biological and social systems of comparable or greater complexity exhibit self-healing capabilities which are core to their survival and so information systems should also exhibit this behaviour. Its impact can be seen both on CFEngine and many other configuration management tools. The idea that agents residing in each node continuously try to keep its promises (in the book terminology), correcting any deviations they might encounter are a metaphor of an immune system.
How do the agents know what is a deviation? By drawing inspiration in a mathematical concept:
A fixed point is a place where you can end up and remain in spite of the specific perturbations that are acting. It is a self-consistent place within a system. The existence of fixed points in systems is a more or less deciding factor for the existence of a stable solution.
A system's desired configuration state can be said to be defined by fixed points. Most configuration management systems (e.g.: CFEngine, Chef, Puppet, PowerShell DSC) are based on this idea: they provide means to declare what must happen instead of requiring imperative workflows that prescribe what to do.
Scale is discussed at length, especially on how it induces uncertainty. The perception of being in control is directly related with the scale we focus our attention on and, by implication, the information that we disregard, whether by conscious decision or not. This effect can be seen on a day-to-day basis. If you are a programmer, you might be concerned with the line of code level of detail. You manager will probably be concerned with coarser grained detail. The tools you and you manager use will be tailored to the scale you're at and so the measurements you make and the information you gain will be different, meaning you can reach widely different conclusions about a project's status. A different example around infrastructure also illustrates the concept. Let's say you have a cluster of web servers and you're seeing steady requests per second values. By that measurement, everything seems fine, but it does not tell you that your disks are nearly full and in a few minutes you'll have an outage. That requires measurements on a different scale.
Scale has a big impact on the notions of continuity and discreteness. Ever since quantum theory was formulated, we know that the world is fundamentally discrete. It only appears continuous, as modeled per classical mechanics, due to the scale at which we observe it. This insight is also relevant to information systems, as it teaches us that is possible to build the appearance of continuity on top of discrete components. It also informs us that we need to be aware of the scale at we look at things. For instance, we create clusters of servers to give the illusion of continuity even when one (discrete) node fails. But we have to be aware of that illusion and measure at the relevant scales to keep the systems running.
Quantum theory also teaches us that the act of measurement itself has an impact on the objects that are being measured. Everyone who has seen a monitoring agent consume an inordinate amount of CPU will understand the theory.
The balance between dynamics (how things change) and semantics (what things mean) is an interesting one. As the author states: "Dynamics always trumps semantics". In order for something to have meaning, first it has to happen in a more or less predictable way. But how do we know how it happens? We have to wait for it to happen and we have to measure it at the right scale. Only then can we attach meaning to something. It is not possible to reason about semantics without taking into account the underlying dynamics. In a world of certainty, we wouldn't need to care about dynamics so deeply. For instance, if we are dealing with a global web site, latency (dynamics...) can have a measurable impact. If we use AWS, or other cloud providers, the dynamics of regions have an impact on semantics.
Sharing knowledge is an important part for increasing certainty, especially when hundreds or thousands of nodes are under management. The book argues that topic maps is the current best knowledge representation. Topic maps represent information through: topics - represent any kind of concept associations - represent any relation between topics * ocurrences - represent resources relevant to a specific topic
CFEngine uses topic maps concepts. They are embedded in the promises description and can be extracted to build a knowledge database.
This is not an information technology book in the usual sense. In fact, the reader does not need to know anything about information technology, although that background is needed to relate the book's message with the information technology field, in particular, infrastructure management.
The book would have benefited from more rigorous editing, as there are several typos along the pages. This minor complaint does not distract from the joy the book gives the reader, though.
InfoQ took the opportunity to interview Mark Burgess on these topics.
InfoQ: The first part of the books explains how scale has such an influence on how we view and understand the world, an influence that we are rarely aware of. Why is scale so important? Do you have any tips on how to raise that awareness on our day-to-day activities?
Mark: First of all, let me thank you for your perceptive review and the invitation to be interviewed. Very nicely written, summarizing key points from the book.
You´re right, I focus a lot of scale in the book, because it´s something that computer science does´t teach. It´s very much a physics view of the world, but so important. As animals, we are so well adapted to do deal with scale in everyday life that we scarcely think about it. But our concepts of reasoning are not so well adapted. I think the reason scale is so important is that it measures how tightly coupled things are in the world. We can only separate what happens at different scales when thing are loosely coupled. When there is strong coupling, we experience “chaos” or very complex behaviour. So the ability to distinguish and separate scales is closely allied with our notions of simplicity.
As for how to think if it in daily life, I think it is incumbent on all of us to see the world in terms of what matters and what doesn´t in the big picture. Many times we fret over small details that have no impact in the larger scheme of things. Other times we overlook crucial details that have catastrophic consequences. If we understood more about where weak and strong couplings occur, we would navigate these things better. In the book I talk about two aspects of the world: dynamics (what actually happens) and semantics (what it means to us). Semantics are something everyone decides for themselves — it is a matter of awareness and voluntary choice, so in the same way everyone can decide to see how scales affect the big picture or not. But it all rests on the dynamics of scales. Having thought about this a lot, with a background in physics, I find it hard to not make these judgements all the time. It really is something that speaks to me from the mundane to the exotic.
InfoQ: CFEngine 3 was heavily influenced by the concept of promises. In what ways can we identify those influences in CFEngine?
Mark: That´s right. I actually spent five years thinking about how I could formulate a theory that would help me to rework CFEngine. CFEngine 2 was not based on promise theory, but it sort of captured it intuitively in some aspects. It had great success and then started to reach its limits as a tool in the mid 2000s. I wanted to see what I could learn from the successes and see how to understand the problems clearly for version 3, which later came out in 2009. The most obvious place you can see the developed theory in CFEngine 3 is in the language itself. Absolutely everything that you express is a promise, or part of a promise, made by some object, or group of objects. Each promise is continuously measured — is it kept or not kept? How it is repaired, and so on. But there are other areas too where promise theory is key. One is in the complete decentralisation of decision-making in CFEngine - quite unlike other configuration tools, but more like network routing protocols. Although most users typically centralise the design of their system design, all decision-making and information is made by each CFEngine agent autonomously. So there is no strong coupling through the network. That makes it very scalable and resilient. The promise model makes it very easy to track the knowledge about the system too, because promises meld knowledge and intent in an easy concept — we can find potential conflicts easily too. It´s a surprisingly power model from a very simple idea.
For me it´s important because CFEngine is not just a build system, like some configuration tools. It also manages the run-time state, over time. It engages almost symbiotically with the running system. Can you promise that a web server will be running, or that it will be restarted within a couple of minutes if it should crash? Well, yes, you can. These are still important issues, even with the current attitude shift towards disposable computing.
InfoQ: Do you find any other configuration management tools that use promises, or that are at least compatible with its principles?
Mark: Actually the networking world has had a simpler time understanding the value of promises, I would say, because networking has pretty much always been designed in a promise-compatible way. But that said, centralised control is always the first idea people come back to when they need to manage something. It´s like it´s hardwired into our culture.
I am actually surprised and a little humbled by how much of promise theory has been taken on board by the industry. The principles are easy to understand, even though the execution of applying it is sometimes harder. Several of the network vendors are advertising products based on it. Of course, it is a theory and theories describe good and bad, so all tools are compatible with its principles. But Promise Theory can say whether a tool uses a simple model or a complicated one — and CFEngine is the simplest promise-compatible model, I would say. I see a great interest in how Promise Theory represents what it simple. It´s quite different from the usual story about programming.
InfoQ: Is the promise theory - and the autonomy for each agent that it entails - needed even for small infrastructure settings or is there a threshold below which a command-and-control style is better?
Mark: I don´t believe that there is a magic scale at which you need to follow sound principles. You should start from the beginning, because the only thing you know with certainty is that you needs will either grow or stop altogether! You might as well start in a sustainable way.
InfoQ: Why is it so difficult to let go of the idea that we have complete control and certainty or, on the other hand, convince ourselves that is possible to leave a state of permanent fire-fighting? How can we change course?
Mark: I think forty years of remote controls in front of the TV have spoilt us into thinking we can just push a button and get what we want! (Laughs) We are just spoilt. I bet farmers and sailors and even pilots have more respect for the uncertainties we don´t control than IT people. We are taught that machines just do what we tell them — after all, we are smart and they are dumb. Unfortunately, it´s often the other way around. One of the smartest systems I know, that has absolutely no pretensions of intelligence, is the immune system in vertebrates. It performs complex reasoning to diagnose and cure us from illnesses that it is actually impossible to know anything about, and all without any recognisable intelligence — just simple pattern matching. The immune system can mount a defence against artificial antigens that have never existed ever. That is pretty smart, but we are often too dumb to realize.
Our human culture is based very much on simple linear story telling. Our logic and our story telling are not separate things. They are the same. If you can´t tell a story about something, you don´t have a reason for it. We build this into our notion of computer programming too. Imperative languages are just about linear story-telling. But there are other ways to tell a story too, by starting at the end-state. Stories emerge from subtle processes, like bee-hives and so-called swarm intelligence. Programmers are not taught to think that way, so we cling to an unsound belief in causation or determinism. It is actually harming the industry now. Luckily, I know a lot of physicists and biologists who work in IT and they are often behind the large installations.
InfoQ: The book also highlights the importance of spreading knowledge and how tools must help in this regard. What is the state of the art in the configuration management tools space? What is your vision for the future or are we already there?
Mark: Thank you for mentioning that. It is something I believe passionately in. Knowledge is the way we scale intent and it is also the result of rehearsal. So it is intimately connected with certainty. I believe that configuration management is just one small part of what we should think of as knowledge management — or seeking certainty in systems.
It refers to both humans and machines. Knowledge is the bridge between human and machine worlds. I think that´s pretty important. Some argue that only automation is the answer. Some argue that only humans are the answer. The answer is always that both need to find their role. I´ve actually argued that there will be more use for arts majors and teachers (people with pedagogical skills) in the future of system management, as we cannot have knowledge without culture and humanism.
In terms of other tools, I haven´t seen an explicit interest in the knowledge aspect. Most people complain that they don´t know what I´m talking about, when I say the words, though I´ve seen some interesting beginnings of ideas very similar to my own from a couple of different vendors recently. Some of them acknowledge my writings as inspiration. I explain some of my vision in the book, but we are a long way from realising it. I believe it also encompasses what the DevOps people call CAMS (Culture, Automation, Monitoring and Sharing) too. More than that I don´t want for reveal now.
InfoQ: Do you have any thoughts on container technologies like Docker, and how do they fit on promise theory and computer immunology? Container technologies seem to favour an immutable approach.
Mark: I have been waiting for these things to come to fruition for ten years actually. I was always a little bit puzzled by full machine virtualisation. At CFEngine we´ve been working with Solaris zones for five years or thereabouts, but it is good to see them in Linux too. I don´t think Promise Theory says anything special about containers. Of course they are autonomous agents in the sense of Promise Theory, but so is a single process, or even a file. This nonsense about immutability is a complete red herring, in my view. I don´t understand why people want to use such extreme terminology. It means what they say ends up being neither strictly true, not capturing what they really mean. I call that politics, not science. There are plenty of scientific questions around process encapsulation and even efficiency of configuration , but there is no reason to connect this to a fad of abusing the notion of immutability.
One thing I find interesting is that, when I wrote Computer Immunology back in 98, I mentioned that biology can plan a numbers game to maintain resilience. By that I mean, you can lose a few skin cells without losing a leg. Biology´s redundancy is through large numbers of similar things. Back then it was not realistic for computers to work in that way, but in the interim we´ve arrived at the kind of scale where that strategy is starting to make sense. This is what I call disposable computing. Throw away a broken process rather than trying to fix it. Machines can be made expendable as long as the total software is designed for it. Not much of it is today, but we´re getting there. Nature shows that this is a good way of scaling services. I am currently writing more about scalability, and the idea of a software wind-tunnel on this subject now.
InfoQ: Mainframes are known for the robustness and certainty they provide, while being one big monolith and usually operated in a command-and-control model. Aren't they a counter-example to the arguments presented in the book?
Mark: Hah! No, I would´t say that. Mainframes are an example of quality and quantity through sheer brute force. But they are not a scalable solution. The lesson of scalability is that a high level result can come from even junk at the bottom layers. The properties of upper layers do not depend strongly on what is underneath, when you achieve a weak coupled design, and scale separation. This is what we´ve spent half a century reinventing around the the current datacenter model. The mainframe was a Swiss watch that went like the clappers, but it could´t scale. Now we have cheap imported watches that plug together, and they are starting to scale very well. Scale is a subtle adversary, a real brain teaser. It too is an aspect of knowledge management: you have to know what you are looking for as you zoom in and out of a system. Then you can use it to your advantage.
I feel there´s still a long way to go in explaining these ideas to the wider world. My book was aimed at an elite of open minded science lovers — people interested in deepening the scientific culture of our field. But as I mentioned towards the end, we can´t rely on reaching just a few curious minds, these ideas need to be socialised for the masses, and built into the technologies we take for granted. It´s always a hard sell to change the world. As Machiavelli said: the innovator has for enemies everyone who does well under the old scheme of things, and only the lukewarm support of those who might do well under the new.
About the Book Author
Mark Burgess is the CTO and Founder of CFEngine, formerly professor of Network and System Administration at Oslo University College, and the principal author of the Cfengine software. He’s the author of numerous books and papers on topics from physics, Network and System Administration, to fiction.