BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Microservices and the Economics of Small Things

Microservices and the Economics of Small Things

Bookmarks

Key Takeaways

  • Truth and reasoning are scale dependent phenomena. As we scale systems, relative weights and designs may change, altering the effective definitions of key decision criteria, like true and false.
  • Modularization is a mixed blessing: by specializing and focusing on semantic distinctions, our attention goes to smaller and smaller issues, and we sacrifice large scale information and predictive patterns.
  • Inappropriate tools for analysing a given scale, no matter how esteemed (statistics, calculus, logic, etc), may mislead us.
  • Modularity may be convenient for division of labour, and help us to reason logically about systems, but it may also compromise availability and robustness of interactions.
  • Computer Science needs a more informed approach to scaling. It should look to the physics of dynamical systems to understand how to identify relevant scales, and when double down or rise up in response to the challenges of the system.
     

Folding society into infrastructure,
and voting for our pleasures...

 

"Never again will a single story be told as though it's the only one."

--John Berger quoted in the legend of
The God of Small Things, by Arundhati Roy
 

Video of the keynote

This essay is about the process of "decentralizing intent" and the effect it has on the predictability of our systems; it is about the changing face of what we can know (in order to make predictions or extrapolate experiences) as we scale systems. It is (yet another) essay about scaling and the boundaries we mentally draw around parts we want to separate. Finally, it is a sequel to my keynote on Microservices.... It is about sharing and mixing, about autonomy and separation of concerns. Mainly, it's about our inability to reason at scale. There are three main parts and a section at the end:

  1. Foreword
  2. Scaling causality by information
  3. Implications of scaling: From microservices to macroeconomics
  4. Endnote: The economics of relative things

1. Foreword

We shape our lives around small concerns, but (thanks to the Internet) the scale of our involvement in the world is ever growing. Take a look at the following list of trends:
 

big data → microservices,
macroeconomics → microcurrencies,
abstraction → specialization,
employees → contractors,
shared platform → standalone.
 

Dichotomies of scale, like these, fly back and forth. For the time being, it seems that we are scaling in one direction, from the large to the small, from the collective to the singular, from centralized to decentralized. We are pursuing a separation of concerns. This is, of course, neither anything new, nor is it entirely true. It is something of a mirage, but one that conceals a few lessons. Never before has such a multiplicity of different scales interacted and intruded into the basic mechanisms of society and its technical infrastructure.

In this essay, I want to examine implications and consequences of this. What does our preoccupation with the separation of concerns mean for our ability to make predictions, and how must we alter our expectations about the things we know and control?

I build on a few simple points, which I believe have deep and far-reaching implications, then attempt to apply the points to cases such as monitoring and planning infrastructure, business, and the economy at large. To begin with, let's pose some simple questions of the kind we need to ask in planning:

  • Would we expect to be able to predict how much Alice drinks each week by watching the level of water in the office water cooler?
  • Could a breakfast-cereal company infer the demand for cornflakes by monitoring the shopping lists of a hundred selected customers?
  • If Alice counts the number of plastic cups she uses herself, can she predict total water usage for the office?
  • Can we predict the fluctuations in temperature of Bob's refrigerator based on the local weather forecast?
  • Can we tell if our software application is using more CPU than it should by monitoring the process list?
  • Do we know our real speed from the speedometer in the car?
  • Can we tell just by looking if we are going up or down a mountain in a thick fog with two-metre visibility? What about 100-metre visibility? Or 1,000 metres?

If we are tempted to try to think of a smart way to divine the answers to these questions, using some algorithm or fancy argument, we would be in (what mathematician John von Neumann referred to as) a state of sin. I chose these questions deliberately because they cannot be answered, as they involve crucially incomplete information. On the surface, they seem like common everyday issues. It seems that we could fill in the missing pieces with a couple of assumptions and take an educated guess. After all, drinking cups of water causes the level of the water cooler to change; intent to buy cornflakes causes purchases, and so on. But what is deliberately misleading is that we cannot know what effect or response will be observed from the causes, because we don't have all the information. There is mixed causality. The intent to cause an effect does not necessarily effect the intent fully, or it might be caused by a competing cause.

The key to understanding these questions is awareness that they grapple with incomplete information. In some cases, we have deliberately tried to isolate the answer from a wider world that could affect the answer (e.g. by building a refrigerator). In other cases, an assumption has drifted but we have not adapted to the change (as in IT monitoring), and finally we are sometimes guilty of mixing information we are relatively sure about into a pool of noise, rendering it completely uncertain.

Why is this interesting? It is interesting because these questions demonstrate ways in which we fool ourselves on a daily basis, with false narratives about the behaviour and performance of our major systems. It happens by ignorance, and our ignorance is getting worse due to the shifts listed above.

Big is beautiful, small is suave

In the developed world, we are breaking up and going solo. We are retreating from a postwar phase of civilization building and hands-on cooperation into a new narrative of independent flexibility and personal freedoms. This includes drifting away from the adventure of large dreams back to the familiarity of our more immediate concerns. Why? Libertarians might argue that we are rejecting central one-size-fits-all government in favour of individual freedoms, but clearly our lifestyle is more organized than an anarchist's lot: developed countries live privileged lives atop a deep foundation of shared infrastructure. Many of the cooperative functions, which have buoyed civilization by the concerted cooperation of individuals, have been absorbed into shared services, norms, and patterns of society, take-it-for-granted doctrines, and shared utilities that keep us relatively safe and comfortable. We are not really more free of the need to share than before, but we are more free of the need to think about it. We can direct a larger part of our attentions to selfish interests, to the god of small things, just as long as someone maintains the selfless machinery of society for us. This has practical implications. I'll start with the technical reasoning of scaling, and then examine some the implications for information systems and for economics.

You can see it everywhere. The exploratory conquest of the 18th century gave way to shrink-wrapped mail-order delivery in the 21st. Where we used to talk about family, societies, clubs, and membership, we now talk more about individuality, independence, and autonomy. Where geographical regions once spoke of treaties, nation building, and alliances for sharing, they now speak of regulation, protections, and independence for enlightened self-interest. Collectivism has given way to capitalism, in the cartoon rendition of economic doctrines. Will globalism give way to separatism? These forces are currently playing out.

In technology, we have moved from integrated software systems to independent microservices, from trains and buses to cars and motorcycles. Floating on the surface, shared computers have given way to to desktops, laptops, and wearables — "bring your own device" (a computer in your pocket). Below the horizon of our attention is a "cloud" of infrastructure that makes it work, where a society of direct interaction once worked. We can't see it, so we expect to take it for granted. But can it do the job we expect of it?

How to play a big fish in a small pond

Important things loom large in our minds, no matter how small they may be in the grand scheme of things; and when we shrink the scope of our concerns to put them in the frame, we often neglect the larger picture, both intentionally or unintentionally. Most of us don't see the connection between pursuing this "divide and conquer" modularity (after all, we are taught this technique from the youngest age) and the impact it has on our ability to know things and reason about them. It is just a simple methodology that makes immediate and individual tasks manageable.

Yet this active avoidance of information must have some consequence: there is no escaping the intrinsic complexity of tasks. Modularization may backfire: the intent to simplify might unfold as an increase in the uncertainties surrounding the outcome.

Tim O'Reilly has discussed some of the many and varied symptoms of the changes to society, at the level of technological work culture, in his book WTF: What's the Future and Why It's Up to Us? These will be quite relevant to my story.

My focus in this essay will be to look first at the underlying mechanics of information, so that we may apply their lessons at all scales, starting from more low-level IT concerns. There are four key themes I want to discuss:

  • How aggregation of things into bulk scale ("redundancy") brings stability and defines our common idea about what it means to know something.
  • How clarity (of intent and "meaning") comes by searching for simple dominant (approximate) signals, eliminating details, i.e., from approximation rather than by chasing data.
  • How achieving the outcomes of our intent (keeping promises) is more precarious on the small scale, where one has no redundant stability.
  • How entropy, or the irreversible mixing of causes, which happens during aggregation, is both our friend and our nemesis when it comes to reasoning.

These are themes I have discussed many times before, e.g. in my book In Search of Certainty. What it amounts to is a conflict of interests between the desire for singular intent and its neutering by a noisy environment. 

2. Scaling causality by information

Information scaling influences how systems work, because all systems run on the exchange of information from input to output. There is observation or data collection ("monitoring"), then reducing data to patterns, storing, and perhaps outputting knowledge in a particular representation. Information also is an implicit dependency in the representation of concepts. In short, we need information to realize intentions, to characterize decisions, to distinguish specific outcomes. The way it is preserved traces what we mean by cause and effect. If the role of information changes, then the characteristics of a system may also change. A little elementary information theory helps to clarify this.

Claude Shannon was one of the first theoreticians to formalize ideas about information. Because of computers and marketing, we have become confused about the distinction between data and information. Phrases like "big data" suggest that we can never have enough data. But Shannon made it clear that having data is not the same as information, and "information" is not the same as "knowing". Data form a stream of observations which we classify into an alphabet of categories or "bands" (hence the term "bandwidth"); information is a collection of messages formed from these symbolic categories (e.g., text strings). It comprises patterns with partial ordering. An irreducible pattern is a symbol or digit of information. These digits are "states" of the system, or different states are different symbols — take your pick. Finally, knowledge is a repeated pattern in the information, a kind of model of the significant patterns in information. We acquire knowledge from the experience of interacting with information over time.

These are the basic ideas behind anomaly detection, from medical scanners to more advanced computer-alarm monitoring software. We first characterize normal by observation, thence anomalous states by deviance. Normal states occur regularly and abnormal states occur rarely, by definition. Thus, we study the differences between common and uncommon by trying to build statistical relationships to sources of data (through various sensors or channels of observation). Building these statistical distributions, through repeated observation, and reinforcement are key parts of cognition and learning.

The balance of information

Too much information can be as bad as too little information, because scale wipes out semantics (categories C). Shannon's intrinsic information has the form:

 

S = - ∑Ci=1 (ni/N) log (ni/N)

for samples of size ni type i that add up to a total of N. The entropy or intrinsic information is maximal when either:

 

ni → 0
N → ∞
C → ∞

Shannon's information is also a measure of entropy, as von Neumann pointed out. Thus entropy (noise) is maximal, either when there is nothing at all to see (ni → 0) or when there is so much to see that nothing really stands out (N ∞ 0). It is the relativistic equivalence of these conditions that can make too much and too little equally empty, under certain circumstances. Attempting to circumvent this basic truth by brute-force searching leads to false positives and a kind of data paranoia.

To learn new things, we need to find new patterns or categories. We can think of these as names "i", and we attach meaning to those distinguishable patterns. What is random and indistinguishable could be called noise. Noise is actually too much information, a signal that is changing all the time, which we don't bother to classify and compress. It is meaningless because we associate meaning with singular signals (and name these as "concepts" — e.g., true, false, red, green, blue, etc.). This is why monuments and logos are striking symbols that stand out against random backgrounds.

Now, consider the scaling of information and meaning, as we increase or decrease the size and amount of what we observe. What happens when we expand or reduce the level of access to information by, for example, restricting measurements to the scope of a walled garden, module, or container? The difference in access will affect the amount of data and therefore the statistical distributions. Will the distributions be smooth, jagged, unimodal, multi-modal, random? Will the chief characteristics follow the same patterns for a single module as for a non-modular system? This is an incredibly important question, which I have the strong suspicion is taken entirely for granted when technology and society get rescaled.

In other words, when we rescale something, do the relevant concepts need to change? If we stick stubbornly to concepts we know from another scale, we could be fooling ourselves. For example, if you grow up indoors, you don't experience weather. If you move to a larger space, you will. If you grew up in a single country, you don't experience different cultures or, conversely, if you are used to diversity, moving to a small town might disappoint you. The point goes beyond these mundane life cases though.

Do we also need to rescale our most cherished symbolic concepts, like true and false? We take these foundational concepts entirely for granted, thanks to the way we have absorbed them into mathematical and religious doctrine. However, their bases lie fundamentally in the relative sizes of sets or the scaling of evidence. Obviously, this is going to play a large role in how we reason about logical systems, especially as we start to scale systems in size or cut them into pieces for modularity.

What follows are some of the ingredients we need to think about in scaling of information, with examples.

Labelling of distinct categories of information at large and small scales: Take, for example, the distinction between true and false or between red, green, and blue. If labels can change when we scale, then we need to rethink our reasoning.

Entropy or loss of distinguishability due to i) mixing of categories, ii) loss of resolution, or iii) intentional aggregation: The popular narrative about entropy is that entropy is bad (mkay) — a decay or loss of order. But in fact we need just the right amount of entropy to make any kind of progress towards order and prediction. Too much order is not good for seeing connections. If everything is separate, there are no patterns. So the god's-eye view needs entropy. For example, if you can't distinguish red from green, then you might not be able reason about traffic lights. If your detector's boundaries and sensitivity ranges change, then what you believe is red might be different from what someone else believes is red. The same, of course, is true for the categories of true and false.

The policy of accepting outcomes and conclusions based on weight of evidence or voting: This is the robustness of conclusions and outcomes from bulk indistinguishability, meaning a large number of votes for only a few distinguishable categories. If categories can change, then statistical inferences can change.

Norms, trends, or smooth robust patterns, and few anomalies are properties of large-scale aggregates: One does not see trends in a small system. If a collection of things is easily countable (in a human sense rather than in a mathematical sense), we treat all the things as independent. If they are not countable, we look for ways to ignore them, rescaling our expectations.

Differential causation: The calculus of smooth, deterministic variation is the calculus of large-scale aggregates, ignoring small-scale granular distinctions.

In the 1990s I enjoyed a little satisfaction at being the first (apparently) to point out and explain the causal patterns of resource usage behaviour in the distributions observed by computer monitoring. When we analysed the behaviour of computer services, in the age of dedicated servers, the busiest servers exhibited a beautiful pattern: a weekly image of the human working week reflected in the access data. Such patterns emerge (for example, on servers) over statistical timescales of many weeks, for large numbers of exterior traffic events, and show that the patterns of human behaviour totally dominate, telling us nothing about software efficiency. Because these patterns only emerge for large-scale traffic signals, they did not appear for personal terminals, laptops, or computers at low load. Today, this has changed. On a computer where nothing really happens, everything is an anomaly!

One consequence of isolating workloads in the cloud has been to interfere with these patterns, by removing the main source of statistical regularity for identifying resource anomalies. Increased separation of workloads by use of short-lived containers spun up on demand across distributed locations has the effect of artificially differentiating between similar things, actually reducing predictability.

Possibly if we had had the patience to wait for an order of magnitude or two longer (a human lifetime), some patterns might have aggregated into a different pattern for laptops. However, that will never happen for cloud workloads, because they are short-lived and move around too quickly.

Perils of cumulative inference: Imagining big fish in a small pond

When we reason about data, we reason about concepts and stable patterns that emerge, not about the individual data points. In other words, we reason about information and knowledge, not about data samples. We sift patterns from noise using cumulative pattern reinforcement, i.e., by learning (today often machine learning) over time. Like hypothesis testing, this works essentially by voting, only not for a single yes/no vote, but more like a political election with several categories, resulting in a distribution of votes over different cases. Incoming data vote for one pattern or another. It is easy to imagine how the scaling of categories alters votes.

Sometimes patterns vote for themselves, emerging unexpectedly and standing out, like the coherent swarms or flocks of the natural world. These are basically anomalous occurrences, but they resemble patterns we have seen in other contexts. Other times, we have a preconceived idea of what we are looking for: an anomalous glint of gold in a body of common silt. What makes either of these cases identifiable or interesting is their probable absence. But it is precisely the commonality of the ordinary that disappears when we operate at a small scale. If you stand on a beach, the sea looks quite uniform, but a single rock pool at your feet is alive with features.

The "damned lies" that statistics can tell are well known to many of us, so it should come as no surprise that how we defined scale is crucial to the fair representation of information. The need to traverse multiple scales, and build a multi-scale understanding of the world around us, is so important that we naturally take it for granted as humans. If we place boundaries arbitrarily and carelessly, we can totally change the knowledge we end up with. Consider the boundaries in the map below, and the counting of two kinds of event: squares and circles. This is basically like the voting for political seats in an election.

Region Conclusion
1 Circles
2 Squares
3 Squares
4 Squares
5 Squares
6 Circles

In this example, it is easy to see that, without boundaries, the circles win the election. But if we first average all the different regions, as in the table, then take the vote based on the averages (like seats in a parliamentary government), then the squares win, because the population is quite different in the different regions. One ends up comparing apples and oranges to get a total reversal of logic.

This is a classic scaling distortion. Because we observe data locally, in closed containers, and trust the promises made by each region as equivalent, we get an "unfairly" weighted vote.

Nyquist's law tells us that we are bound to get this wrong unless we sample directly at the scale of the smallest changes (in politics that is called proportional representation). Introduction of a hierarchy of boundaries, with upward aggregation, can completely alter the outcome. This shows how easy it is to pervert the intent to find a reasonable answer to a hypothesis. Any binary choice is basically unstable to rescaling.

Circle/Square
True/false
Normal/abnormal
Promise kept/not kept

Pointing out a much overlooked scientific fact: logic is not scale invariant!

Every time we rely on a promise, we are sampling the promiser/keeper of a data source. The promised outcome assessed by the promisee depends on the promiser's fundamental Nyquist frequency.

So the peril of going small may be to fundamentally localize ignorance of wider context. I believe that that these shifts, within our technologies and our societies, to focus on small private concerns instead of publicly shared concerns will begin to erode our ability to predict outcomes that are relevant to everyone.

Trust and the scaling of true and false

Let's examine this more closely (or skip this section if you are bored). The stability that comes from redundant repetition and bulk approximation effectively grants us the licence to use these methods. If we try to apply them to small-scale, granular changes, they cease to be accurate, and may actually tell lies (e.g., phenomena such as Simpson's paradox may also come into play, because what is true of aggregate data may change qualitatively and quantitatively as we expand or shrink our horizons during scaling).

There is a hidden assumption here, namely that we can form an ensemble of systems that are equivalent and seem commonplace into a background. We expect to see a current of anomalous changes (fluctuations) against this background sea of similarity, acting as a reference for the concepts that are changing. If we see too much of the background, we can get lulled into thinking that is all there is. For a long time, human affairs have occupied a realm whose scale was relatively fixed and grounded in our evolutionary past. However, we are now exceeding those boundaries and confronting multi-scale operations. This brings with it surprises.

The sensitivity of any statistical conclusion is in some sense inversely proportional to the body of equivalent data. If we redefine limits and isolate systems so that every example represents a unique context then we sabotage that assumption. It's not about how big your data set is, but about how much entropy is contained within. If we open systems to every impulse from beyond, we may take on more data than we can perceive. So we need to find a balance. This is an inconvenient and troublesome conclusion. We prefer the imagined clarity of true and false.

Following the argument above, we see that empirical science is basically about voting. An experiment is a hypothesis election. If we can obtain data to support a position, i.e., if a sufficient number of observations vote for one hypothesis, then we tend to believe it is "true". Occasionally, there are several parties to vote for in the world-view elections, but if we reach a clear outcome, there is political stability amongst scientists often for years. This belief is shadowed by a long history of "deontism" or belief in divine laws (possibly emanating from the biblical story of Moses). Karl Popper later skewed opinions to place perhaps an unfair focus on falsification, as if the evidence that falsifies something would be more reliable or generalizable than the evidence that supports something. That this position gained so much support is also testament to the belief in the existence of a right answer and a wrong answer, rather than a progressive patchwork of approximate explanations.

Our trust in the unequivocal nature of truth is flawed. Beliefs scale. What is at stake is our whole system of belief in right and wrong, including the ability to make predictions based on a body of evidence (what we call the scientific method). Taking a vote is an entirely reasonable policy for resolving doubt (in the absence of some other incontrovertible evidence) but we should not imagine that it is some magical elixir of truth. Even our newfound enthusiasm for machine learning and big data are based on this view of truth and cognitive stability. But this is the very first time in history that most disciplines have had to confront a real change of scale, in which behavioural phase transitions are a real possibility.

Can we pass the true/false test? Do we trust? One of the illusions of scale we've come to trust (because we learned it in school!) is this simplistic binary logic, which is a very particular, highly constrained form of reasoning — but one that we believe implicitly in our day-to-day use of information. Trust turns into belief. This trust is misplaced, because logic is not an invariant property of systems as we scale them.

So, let's step back for a second and think about how we approach trust in the systems we build, as we start to scale them in ways we have never had to before. One of the key aspects of that is breaking systems into component services. Our argument about scale may gently crush your dreams about what we think we know about logic and reason and how it scales.

Smaller grains of truth: the ghost of Nyquist past

Logic is not a scale-free phenomenon. It is an idealized artifice of mathematics, which has an uneasy connection in science and phenomenology.

George Boole himself did not assert a dogmatic schism between true and false that many later attributed to Boolean logic. Indeed, he treated the continuum of certainty values between 0 and 1. We've grown to think of such a numbers between 0 and 1 as being probabilities, but that idea is not as clear as we've been taught either. How should we decide the precise value between 0 and 1? The answer is that it starts with an assumption about the scale of observation, such as the sampling rate in Nyquist's law, and the tacit belief in the relationship between past, present, and future. Rather than being a basis for invariant truth, probabilities are properly understood as dimensionless scaling ratios, akin to what physicists would call renormalized variables.

Nyquist's law is a basic result in information measurement and representation theory. It says that, in order to measure a change on a timescale T, we have to be sampling the phenomenon faster than T/2. The converse is also true. If we are sampling every T seconds, then we will generally be unaware of any changes that take place faster than T/2. If a security camera sweeps every T seconds, thieves can sneak past if they are quick about it.

If a cloud application only runs on demand for a few minutes or even seconds, it may never explore regions of its phase space where problematic behaviour occurs, so we can never train an algorithm to detect it.

We all know the aphorism that absence of evidence is not evidence of absence. Nyquist's law formalizes this. It is one thing to say that true and false (black and white) are intentional complements, but in the figure below, we see that the difference between the observation of black and white is not as clear cut as we might think. This is simply because knowledge is never based on a single point datum; each observational concept is an ongoing cognitive process. 

Even the apparently simple assumption that true and false are opposites (complements) need not be true.

To see this even more obviously, look at the circular sampling regions in the figure below, and consider how the sample resolution (represented by circles, which may extend in space or time) can lead to different conclusions. If we look on a small scale, we might think the system is white. A larger circle might consist of multiple points, some black, some white, some yellow, blue, etc. If we sample too small, we may not see all the possible values a measurement can capture, and we might be misled. If we look on a large scale, there might be multiple values within a single sample. How do we decide what value a sample represents? What is the representative value? What is the average of red and black? Is the sample black, white, red, green, or blue? How shall we represent mixtures? 

Is the answer black or white? Or yellow? Or all of the above? The question for any scientific observation is: how can we know the correct size to sample?

Some promise theory... on truth and data science

One could argue (indeed, one did after reading my initial draft) that truth doesn't change as we rescale. However, our assessment of it does.

Suppose I say that men have two legs. After collecting a huge amount of data, I find there have been men born with three legs. It's not that "men have two legs" is true on a small scale but false on a large scale. It never was absolutely true. It was very nearly always true, and remains so after seeing more data.



This comment illustrates that our notion of what is true is a prior construct (a template, or what information theory calls an "alphabet symbol") rather than an empirical fact. It is a position with a similar status to a hypothesis. If we think in terms of promise theory, we look at all the humans in the world as agents that promise a certain number of legs (a + promise). The observer has to make a corresponding promise to accept (-) or reject the promised data from each of the members of its sample set. Moreover, the observer is free to make its own assessment of the state of the world. It could reject humans who don't have two legs as being outliers, or it could re-characterize the nature of what it accepts as truth. However we look at it, our assessment of truth must change quantitatively and perhaps qualitatively as we rescale the superagent boundaries of our data sample.

In aggregation, how do we draw the line between a composition of distinguishable things (a new thing), and a larger ensemble of indistinguishable things (a bigger thing)?

This simple point of how monitoring can let us down should send shock waves through the halls of companies that do computer monitoring and data science (and their complacent customers), which do basically nothing to address these scaling issues (as I commented in "Artificial reasoning about leaky pipes: How monitoring lets us down by shrugging off non-trivial causation").

The cone of shame: Why unit testing promises nothing about causality

How does this affect our ability to reason? Causal reasoning (e.g., for prediction and diagnosis) is based on the assumption that what we know about the present (and possibly the future) may be influenced by or even be a direct continuation of what happened in the past. More generally, we mean that what happened "there and then" is assumed to be a reliable guide to what happens "here and now" and possibly "elsewhere and elsewhen".

The events that could influence the present form a space-time cone (called the light cone in physics). It represents all the signals that could have travelled at up to the speed of light to reach the present. This is the theoretical best case (ignoring quantum mechanics) for what might influence the here and now, for causality.

But, in practice, we are not able to observe everything that might have influenced this nexus of causation. If some agent is not capable of receiving signals from some area or capable of distinguishing them, then the amount of information is reduced. An agent that is confined to a box, for instance, can't see very far into the past cone at all. The more we attempt to isolated an agent, the less we observe of the past cone and its ability to causally influence it, reducing the bulk of prior evidence and making prediction harder.

Notice that logical isolation does not protect the agent from those past influences. Only complete physical isolation can do that. So the notional modularity of software engineering does not isolate cause and effect, it only puts our blinkers on about how it might arise.

Every time we limit context (the foundation of causal behaviour), we reduce the amount of evidence on which we could draw in order to make inferences. The causal evidence that underpins all decisions and outcomes is much more extensive for the interacting whole than for the component in isolation.

In software development, unit testing is the deliberate strategy of limiting causal influence on a piece of computational machinery in order to filter out unwanted causal effects. First, we test component by component, then we test their integrated assemblage. Unit testing is a facsimile of factory component testing, used by manufacturers of components to be sold as a separate commodity. If a component keeps a clear promise then it is the responsibility of its buyer or user to use it correctly. The manufacturer has separated its concerns and pushed responsibility for its behaviour in context onto someone else.

Deliberate separation of testing into units and integral usage is peculiar in software. There is certainly no point in performing unit tests on things that are not reusable, and such tests might give a false sense of security.

According to promise theory, a conditional promise is not a real promise, because it depends on its condition. So a unit test is by its own admission only a partial, conditional promise of functionality assuming nothing about its inputs. It is only a promise if we also promise what values it will be fed, like promising that sunglasses will work in the dark.

I suspect that many developers do this from a basically misplaced trust in the separability (effectively the linearity) of cause and effect in a system. System behaviours are not sums of parts, but collective excitations on top of a network of promises.

Cognitive observation, or why machine learning will not help

Some believe that machine learning is the answer to every problem. We throw big data at a neural net and it will simply figure out everything. It should be clear now why this is wrong. Without knowledge of the scales for sampling and Nyquist's law, we can't know how much data we might need to resolve the patterns and phenomena we might be looking for. We don't initially know the entropy of our system. Whatever successes deep learning may boast today, it is all down to successes of handcrafted training. Conclusions are contextually anchored by the interaction of at least two different feedback timescales: the timescale of introspection (or recalling existing knowledge, gathered over long times) and one for "extrospection" (sensory input, being gathered in real time).

Statisticians have made a big song and dance about the non-existence of causation, because in their field causation is all but impossible to reverse-engineer from data. But their own conclusion fails the scaling test. What might be true in the limited field of statistical inference need not be true of the wider world. The problem in pattern recognition (causal inference) is the inference, not the causation. The AI version of the aphorism above might be phrased: the intent to find evidence is not evidence of the intent (i.e., of the recognition of a pre-existing pattern).

The reason unsupervised learning remains challenging in AI is that it is a multi-scale evolutionary selection phenomenon. It takes time on a scale much greater than the useful life of a single cognitive process. No one is willing to send a generic AI backpacking around on a personal journey of discovery to see what it might come up with. This is one reason why AI will not pose the kind of threat that scaremongers warn will come about about as a result of creative genius.

The adoration of calculus

Finally, let's explicitly call out two of the most cherished doctrines of the enlightenment, differential calculus and statistical methods, as worthy of much greater scrutiny, not just our blind trust. If separation of concerns is IT doctrine, then calculus and probability are the unassailable doctrines of modern science. Like other foundations for knowledge, these have become infrastructure. The use of these methods rests on assumptions that can be invalidated by changes of scale. As we isolate smaller pieces of systems in an attempt to understand causation and avoid unwanted influences, we are confronted with a paradox: our ability to predict behaviours based on prior evidence gets progressively worse the more we isolate a system, because our amount of bulk evidence is reduced. I've already made this clear for statistical probability theory.

In the case of calculus, the assumption is one of idealized micro-changes: the extrapolation of small to zero. In centuries past, Newton and Leibniz independently developed this ingenious technique, effectively compressing information about infinite summations of tiny changes into extremely simple (but approximate) patterns, with virtuous analytical properties (think of our cherished functions sine, cosine, logarithm, etc.). Such was the beauty and perfection of these forms that many came to believe in their truth as actual objects of nature — just as astronomers once tried to fit the orbits of the planets into regular polyhedra, hoping for some presence of a divine message to confirm their eternal souls.

The sheer convenience of working in closed forms, with real numbers, which could be computed to arbitrary but misleading accuracy, led to the unfortunate phrase "exact sciences". The exactness, however, is only within the approximation assumed. The approximation they used was the differential calculus.

The idea was that if we could imagine infinite resolution and infinite sampling of data, we could smooth out all the bursts of data into a simple compressible representation that we call a "smooth function". Fourier summations and other transforms fill in the details of how this happens piecewise. The spiritual seduction of this idea is hard to overstate. Until the 20th century and Boltzmann's brilliant insights into the fundamentally non-deterministic discreteness of things, differential calculus held science utterly in its thrall. Indeed, in some areas of science, it still does. Because the method takes the limit of infinite resolution

Δ t → 0,

it cunningly conceals the key question: over what scale is a derivative with respect to time valid? In other words, from what point of view, what negligence of probing, can we say that the changes in a process are smooth and instantaneous? Data are fundamentally discrete collections of discrete events. One cannot apply calculus to real data, only to approximations of a lot of data, where rescaled averages make sense: idealized data, with the assumption of local continuity. When we apply differential methods, we assume our data are in a state of smooth continuity or local equilibrium at every point, at some infinitely small scale, below the limits of resolution. The problem with this viewpoint is that it is at odds with the theory of errors and uncertainty (or information theory, if you prefer), which tells us that there is a finite granularity to data based on the limitations of our process of observation. This is one reason why economists have made basic errors of theory in applying calculus without a clear model of scales to economic scenarios.

In some cases, the differential calculus we have come to cherish in classical physics, and later beyond, is a calculus of hubris. It is a calculus of trust in the eternal stability of an underlying picture whose parts are assumed to be in local equilibrium. It's a product of the religious convictions of the society that begat it. This is often true in physics, where the mind-boggling range of scales justifies the approximation in most cases (hence the huge success of the calculus in modern physics). But it is also applies to circumstances where there is no obvious recourse to a continuum approximation: no law of large numbers to appeal to. Economists and sociologists revere these totems of power from the elementary sciences, apparently unaware of their limited realm of validity.

It's remarkable how rarely we confront these assumptions. How we reconcile calculus with probability, for instance, involves a ream of assumptions (see the discussion in my book In Search of Certainty). We often treat variable probabilities as continuous variables when they are really discrete or rational fractions, in a form of analytic continuation. Bayesian interpretation fixes several interpretational subtleties with the frequentist interpretation of probability.

There are many real-world examples of the rough emerging from the smooth:

  • Mountaineers know that snow starts out smooth, firm, and predictable, until it recrystallizes from below into larger crystals with weakened mechanical stability. This leads to avalanches.
  • Society is predictable and stable when united behind a singular leadership and culture. When it fragments into special interest groups and differing political ideals, the cohesion is lost.
  • The weather is predictable and stable when large uniform areas of stable high-pressure air dominate the patterns. When these fragment into smaller cells of storm systems, this collapses into chaos.

The differential calculus cannot be the right language for a world in which the semantic units and behavioural cells are discrete and getting smaller — in other words, where smaller and small distinctions matter to us. 

3. Implications of scaling: From microservices to macroeconomics

Finally, let's look at how these points apply to some key examples. How has our ability to influence and reason about our systems been influenced by scaling?

In the beginning, I posed a few questions. We can reframe those now:

  • Why are small businesses less predictable than large businesses?
  • Why are short-term changes less predictable than long trends?
  • Why are small software installations harder to optimize efficiently than large ones? Why are there economies of large scale but not small scale?
  • When is a behavioural anomaly significant? When it is rare amongst a large population?
  • When is it appropriate to gather big data, and what could it tell us about smaller samples?
  • What does the result of the national election (as in the voting map earlier in this essay) tell us about public opinion, about the allegiances of a county or an individual?
  • What does a CPU anomaly measured on a physical computer tell us about the behaviour of a single process on that machine?

Questions like these are ubiquitous in work, in management, and in science. We can add a few specific notes to the different areas of applicability.

 

Scaling of concepts from small (above) to large (below).

Information infrastructure at scale

Information technology has evolved from the use of individual computers to massive managed server farms (from individual hands on to aggregate hands off) back to individual responsibility for workloads operating on top of server farms (something like the shift from homestead farming to nationalized agriculture and leasing land to privately rented growers, organized by product).

  • Breaking up hardware and software into smaller pieces adds overhead per instance and so is less efficient in terms of individual costs but can be more efficient in shared costs. It is the way we enable innovation and diversity. We can adapt issues, which we can still treat as common to all workloads, to environmental patterns on a scale larger than the individual workload (e.g., typical user load, weather, temperature, prices, etc.) and optimize for cost and predictability. To the small parts, large-scale patterns just look like random anomalies.
  • Spot prices for cloud rentals follow large-scale patterns but appear random to small-scale users because the information concerning the patterns of availability are not available at the scale of an individual. Individual workloads may still perceive the rental as cheap, but the cloud landowner makes significant profit because it does not pass on all its efficiencies of scale to its tenants.
  • Sharing of specialized tools is less likely to match relevance criteria across workloads, so the total costs of tooling rise with specialization. Without cross-cutting/sharing services, knowledge of local circumstances and global "weather" (market prices etc.) will not be observable to each private farm.
  • Individual customized web services and APIs all have generic (high-entropy) IP transport as their basic dependency. All web service routing is dependent on the routing patterns of the IP layer, which is optimized without reference to payload (i.e., not by application, i.e., net neutrality). Individual applications cannot distinguish the effects of those patterns so latencies and failures appear random, not causal, to the application tenants.

The lesson is that small things can benefit from the knowledge of collective experiences of similar small things, but only if data are shared only between similar things. For distinguishable workloads, this doesn't happen. Dissimilar things will only muddle conclusions, so specialization is the enemy of experience. The economic viewpoint will only by available to aggregators like cloud providers, brokers, and utility providers.

Economies of money at scale

In economics, we break up flows of money into separate categories or "budgets". Sometimes, we even divide these further into currencies (currencies and budgets are somewhat similar things).

  • The budget for an individual person is unlike that of an aggregate, a company, a household, or a city or country. Budgets for aggregate organizations may cover many concerns, some of which overlap. From the perspective of a single budget, a larger economy appears random and capricious, because the causal reasons for change are not observable. Budgets are therefore not scale invariant.
  • If budgets can me mixed, i.e., earmarking of funds can be ignored, then a surplus in one post can make up for a shortfall in another. Cash flow is thus stabilized by entropy and destabilized by specialized earmarking. This is why merger and acquisition may be attractive for small companies.
  • Globalization has deconstructed some of the boundaries between geographical regions, allowing surplusses and shortfalls to equalize across national boundaries. Politics, laws, and treaties define boundaries that may be geographical, corporate, or based on common interests. Flow economics applies to services, utilities, and to money (which is basically a network delivery service).
  • The classical way to compensate from the exposure to uncertainty is by redundancy or hedging, i.e., by building an economy of scale as a large entropy pool from which one can borrow. The amount of entropy in the redundant parts will determine whether there are patterns we can adapt to or optimize for.
  • This method of stabilization is exactly as in thermodynamics where work/temperature stability is managed by arranging for a global heat reservoir that absorbs local fluctuations. Central banks play the role of a stabilizing reservoir for national finances.
  • There is a well known theorem in reliability theory and queuing theory that says that one can always achieve better cost/throughput efficiency by mixing separate queues into a single high-entropy mixture, with multiple indistinguishable servers, rather than by trying to separate multiple queues. The cost lies in making the servers capable of appearing indistinguishable with the same competence. Making queues or workloads indistinguishable allows us to drop labels and handle the work with maximum availability.
  • In cash-flow accounting, we partition time typically into monthly intervals. If one agent can't pay its debts before the end-of-month deadline, it can borrow from another (free of charge, without interest) simply by maintaining a common pool of resources.
  • The role of government involvement and monopoly in organizations is much discussed in politics and economics. The question is: is it more efficient to scale some activity as a single vertically integrated firm or organization, or should it be broken up into a network of cooperating entities? Promise theory tells us that there is no real difference between the two, provided i) access to the information is the same in both cases — i.e., secrecy, proprietariness, and lack of cooperation does not disadvantage the operation of certain parts of the organization — and ii) if there is a single all-powerful boss strongly coordinating instead of a market responding with weak coupling. A market is likely to respond more slowly than a boss or management channel (unless management is truly hopeless or the market is a high-speed electronic platform) because a market requires its own distributed equilibration, but a single coordinator might not have the same capacity to adapt to the many component specializations.

    How we set the timescale of that deadline plays a crucial role in system stability. If the payment deadline extends to infinity, there is no need to pay anything back. If it shrinks to zero, many payments could not be made, and companies would not be able to do business. Coarse-grained approximation is therefore a significant feature of economics. Instantaneous response is not a useful approximation.
  • The study of scaling in the economic indicators for companies, cities, and communities was pioneered by Geoffrey West (see his book Scale, at the end), Luis Bettencourt, and others. They observed that growth in output with size of population could be attributed directly to the implicit ability of infrastructure to absorb the scaling changes, connecting economic entities without limit. In the case of companies, the scaling is different: it is linear rather than superlinear for cities. This doesn't quite make sense unless something special happens between scaling a company and scaling a city (some companies are the size of cities). My small contribution to the promise theory of this scaling suggests that the independent scaling of subsidiaries and service-component dependencies plays a role in the superlinearity of city scaling, which in turn suggests that free markets might grow faster than monopolies, whether their sum costs are more or less efficient on paper.
  • West also pointed out that scaling of the economics in urban society is not just exponential, but super-exponential (for a summary, see Scale). As infrastructure becomes saturated, cities have to rely on innovation to increase capacity "just in time", else one might see a catastrophic collapse. The infrastructure acts as a stabilizing reservoir. Each cycle of rescue from the brink of collapse by innovation is playing out faster. It is a self-imposed arms race with accelerating pace, one that we cannot hope to win sustainably forever. To find acceptable answers for human society, Tim O'Reilly has argued that we need to ensure that our moral compass triumphs over the choices we make in this economic arms race.
  • When economists discuss markets, they make assumptions equivalent to indistinguishability of commodities and firms, or high market entropy, which they believe allows them to speak of equilibria. If the economy ever resembled such an ideal gas, it was at a time of large commodities like oil, steel, wheat, etc. In the specialization economy, it is going in the opposite direction, from few large reservoirs to many small ones, making equilibrium of consumption fragile. Moreover, the complex webs of dependencies in the modern component-manufacturing age of globalization render the closure of networks, which would be needed for equilibrium, suspect.

The lesson here is that aggregate buffers or pools of resources are essential to be able to handle unexpected peaks of necessity. In economics, we need to manage liquidity to avoid catastrophes of our own making. We could redefine away most catastrophes (at least those of our own making) if we understood scaling. You cannot predict the liquidity or cash flow of a company from the economy at large, because it is an aggregation over many dissimilar things, over timescales that are much longer than monthly cash-flow cycles. So economic forecasts are basically nonsense for most members of society.

In promise theory's scaling language, we collect clusters of individual agents together into superagents. At each scale, there are new promises and hence new responsibilities in the network. Any simple pool of agents that are unified by the need to keep similar promises can form a pool, buffer, or reservoir of resources in order to borrow from the shared pool and ride out the spurious plusses and minuses of living.

Society at scale

Society has to change in response to evolving demographics as technology succeeds in extending our lives. We must absorb more of the functions of society into an automated infrastructure. Tim O'Reilly has given probably the best overview to this subject in his book (see end note).

  • Employment becomes a service (utility) rather than property. With information technology to coordinate, we have companies like Uber and AirBnB acting as entropy generators: schedulers and allocators for shared resources of society. In the past, specialized workers have been attached to a proprietary container, like a single company. Now the Uberization of the workforce or the rental of workforce as a utility, without single ownership, transforms the workforce into a high-entropy reservoir for maximum availability, if the jobs only have high-entropy requirements (distinguishing one worker from the next is not important).

    Whether or not we choose to lock workers into individual boxes through ownership by companies is a choice. O'Reilly has argued that how we make those choices needs a new level of moral guidance that appreciates scaling for all of humankind (not just shareholders). We have to ask if the work is for the benefit of all or whether competition must be hostile.
  • Through interest rates, banks herd investment and manipulate consumption cycles. Most banks are independent entities. They provide access to money (credit or savings) as a utility, but they weather feedback cycles that are unstable without regulation. Central banks attempt to engineer behaviour by reverse causation, by promising interest rates for private-bank lending, in the hope that these will be adopted by private banks and potentially encourage borrowers to borrow more or less.

    The expectation of an immediate causal response in macroeconomic policy ignores the scale differences between individuals, small companies, and large companies. It stretches common sense to assume that when interest rates fall, businesses will rush out to borrow money, or that when rates rise, people will rush to repay loans in an immediate and deterministic fashions. At best, there must be hysteresis, a delayed response, but this is not accounted for in the local differential models of macroeconomics.
     
    Δtresponse ≫ Δinterest


    Large-scale models, e.g., macroeconomic models, work by aggregating categories or dismissing distinctions in order to reduce the number of variables, and argue for the viability of a differential model. But, having removed details on the scale of individual humans, businesses, or business cycles, we have to ask: i) on what timescale might the results be relevant and ii) to whom (what aggregate average human circumstance) are the results representative?
  • The semantic interests of multinational corporations on local communities (especially those in underdeveloped countries, whose cultures are far from industrialization), cannot easily take into account their concerns because they operate on too large a scale. Applying a blunt instrument to a finely grained cultural diversity risks wiping it out. This is a familiar story in the modern world.
  • We see general a degradation of privacy for individuals (thanks to smart phones and online services), while there is less of an increase in the transparency of corporations or governments, where it is easier to build barriers. As superagents are formed by aggregation, each new scale makes new promises, and therefore has new responsibilities.

Politics at scale

Geographical boundaries, as political regions, are a major semantic limiter on systemic predictability.

  • The appetite for merger and acquisition of nation states, despite the potential economies of scale, has waned in the modern era. Indeed, recently we see a rise of separatism in the UK, Spain, Hong Kong, China, and even the retreat of the US from the global stage. The potential benefits of scaling for a country are the same as those for organizations and firms. In fact, national separatism is on the rise, indicating precisely the triumph of special political interests over general economic efficiency.
  • The ability for governments to tax citizens as a mandate to work for the common good is compromised by the separation of national finances by political regions (as witnessed by the tax claims on Apple, Facebook, Google, and others by the European Union, for example). Company boundaries do not need to respect national boundaries or currencies. This is indicative of what is likely to come in the future: it is currently prohibitive to gather tax revenues from multinational companies and the explosion of currencies (including loyalty discounts, flight miles, etc.). A way to restore the entropy of electronic monetary currencies in a politically expedient manner is needed to preserve the current system of government.
  • Nation states have the sovereign right to write off debts within their borders, but no country can do this internationally without the permission of a multinational body (witness the case of Greece over the past decade). Lenders like the IMF and World Bank could do this, but lack the political mandate. The inability to write off debt between nations hinders the equilibration of inequality and sustainability.
  • The inability of many European countries to form coalition governments, due to a historical proliferation of nuanced political parties with incompatible semantics, is a result of attempting to retain a broader sampling alphabet, rather than breaking the country across a knife edge of black/white, true/false, left/right. Oversampling of public opinion then has to be resolved by negotiation, which has a questionable link to the reasoning behind the initial vote.

4. Endnote: The economics of relative things

Scale is a simple but subtle master. As I wrote in my book In Search of Certainty, we can learn from dynamical similarities, provided we are not slaves to too much individual detail. Entropy and interchangeability can be our friends. Otherwise, we have to apply brute force to manage an information-rich world of diverse categories in the face of indeterminism.

We currently believe that going small means better control, but this is not guaranteed. Going small also means less certainty, and less stability.

Our desire to pursue autonomy and personal freedoms drives us to a world of small things, on top of a world of growing infrastructure (these two worlds are separating, like H.G. Wells's Morlocks and Eloi), but we should not fall for the mirage of isolation. As we shrink our world views around the viewpoints of isolated subjective agents, that belief in predictability will start to look as random and as spurious as quantum fluctuations. We need to be ready to take on this reality. We have to stop trying to live by futile questions, like these parodies of typical management issues:

  • Is the weather right or wrong for my holiday?
  • Is cloud computing good or bad for my application?
  • Is the economy good or bad for my business?
  • Is culture good or bad for my taste?
  • Is demand killing my product?
  • Is society harming or helping you?
  • Is aggregation good or bad for the individual?

Promise theory makes it clear: we are both givers and receivers of causation. If you are still asking questions like these, you might need to rescale your definition of good and bad.

Science tells us that our well-worn definition of "truth by statistical voting" cannot necessarily work the way we want it to, if we choose to pursue detailed semantics, because semantics change as we rescale. True and false are luxuries of slow-moving, bulk systems. The scaling laws do not apply to small numbers. The bottom line is this: if we don't want to have some unpleasant surprises, and if we want to claim impartiality in a scientific sense, then we'd better adapt to the economics of small things, atop a generic infrastructure that absorbs the shocks and keeps it all together.

Wed Jan 10 13:51:52 CET 2018

Acknowledgement

I'm grateful to John D. Cooke for a critical reading, and many helpful comments.

Further reading

In this essay, I refer to three books:

  

About the Author

Mark Burgess is a theoretician and practitioner in the area of information systems, whose work has focused largely on distributed information infrastructure. He is known particularly for his work on Configuration Management and Promise Theory. He was the principal Founder of CFEngine and ChiTek-i, and is emeritus professor of Network and System Administration from Oslo University College. He is the author of numerous books, articles, and papers on topics from physics, Network and System Administration, to fiction. He also writes a blog on issues of science and IT industry concerns. Today, he works as an advisor on science and technology matters all over the world.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT