Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Observability in the SSC: Seeing Into Your Build System

Observability in the SSC: Seeing Into Your Build System



Ben Hartshorne describes the transformation that Honeycomb went through, when they dropped build times by 40% and gave themselves the ability to track build times and asset sizes over time. Hartshorne covers the techniques one can use to accomplish the same goals in different environments.


Ben Hartshorne is an engineer at Honeycomb. For the last 13 years, he has built monitoring, alerting, and observability systems for companies ranging from startups like Simply Hired and Parse to large organizations such as Wikimedia and Facebook. He enjoys this work and is happy to finally be building a company and product that will help tease out nuances in data in novel and powerful ways.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Hartshorne: My name is Ben Hartshorne. I have spent most of my career in running production software, so operations back when ops was its own encapsulated thing, extended down into the network, and then as our industry is progressing up into the software stack to be DevOps. Then at Honeycomb, I started and was working on the actual applications themselves. We didn't really have ops, so I was just dev. This puts me in a nice place to think about the whole system, the software supply chain and where we're going. I work for We are an observability company. We make tooling to help other companies understand their complex production infrastructure.

I'm going to tell a story. It is a story about a problem we had at Honeycomb. It's not in any way unique to Honeycomb. In fact, I will bet that amongst this audience, 9 out of 10 of you have had this problem at some point in your career. We are in a unique spot to be able to bring some new insight into this problem. I'll walk through how we did it and give you some code samples and talk through the actual implementation of the problem and our solution. Then the last part of this talk is up to you because build systems are just one small part and there's a lot of work to be done.

Story Time

Our story: we're building software, we're starting a company, the code base grows and our builds get slow. We thought they were slow because anytime you're sitting there watching a build go by it always feels slow because you're waiting. We realized we actually didn't know, was it just this one build that I was watching that was slow? Are they slow overall? We looked at our CI system and we got this nice long log of all of the things that we're doing. We can stare at the timestamps a bit, but that doesn't really give us a good feeling. Is this subjective? Is it objectively slow? We had no idea.

At Honeycomb, we are building tools to help companies understand complex infrastructure. We believe in this idea of observability that through instrumentation and through getting telemetry out of the systems that you run, you can run them better, you can make better decisions, and improve your lives overall. We did that and we got some pretty traces of our build. We decided to use the model of distributed tracing to look at what happens in a build system.

I'm going to run through what this graph represents. For those of you who haven't looked at a waterfall diagram from a distributed tracing system before. On the left, we are looking at what happens. On the right is how long each of those steps took. The entire build here is represented at the top, with each step being indicative how long it took, and the location of the bar indicates when the job started. This is a visual record of the build system going through each of the commands that it needs to do and representing visually how long that step took.

We stared at this for a minute and we thought our build was slow. Looking at it, what are we supposed to do here? How can we make it not slow? The go tests took under two minutes to run. That's not bad. The other longest pull there is 88 seconds. See, not too bad. The total run, 398 seconds, just under 6.5 minutes. That's actually ok. We decided to do nothing, which is actually a wonderful thing to be able to do. When you are confronted with a problem, you can look at some data and you can decide, "No, I'm not going to do anything. Going to go back to solving other more business-relevant problems." That's great.

We didn't do anything, we let it sit. The instrumentation was there. A year later, when we thought our builds are feeling slow, we could look at that progression over the course of a year. Yes, they went from roughly 6 minutes to up on 14. It's getting into the territory where it starts to be useful to think about doing something about it and we had this instrumentation. We looked at the same view and noticed, "Yes, the bars are longer. Our build is slower. It takes more time." There still wasn't a really clear indication of what we should do to fix it.

The other bit we couldn't really say for sure, this is a representation of a single build. Is it representative of all of our builds? It was the same questions we had before. With the instrumentation around each one of these steps, we can look at them individually. This is the same one-year timescale we're looking at before where each line is the 95th percentile of build time for that particular job. This is really different from anything we expected. We thought our build was fast earlier and then it got slow. Surely, all these lines are just going to slowly go up into the right. They don't. Some of them do. The go tests, our Go codebase got significantly larger. We wrote a lot more tests, they got slower. That makes sense. The yarn test bounced all over the place.

I don't spend a lot of time in JavaScript. I don't really know what caused those. It would be interesting to dig in and find out more. I'll just wave my hands and say, "JavaScript." The interesting thing here is that the build wasn't actually doing what we thought it was doing. We have new insight based on instrumentation. We added instrumentation and it changed our mental model of what the build is doing. We were able to come up with a more accurate representation of what is going on in order to better choose what work we're going to do. The first time we chose to do nothing. This time, we need to do something because the build really is getting longer. Having that level of detail allowed us to make better choices about what to work on next.

In this case, we wanted to parallelize some of the steps and we switched to another build system. It was conveniently timed, we had a couple of reasons for wanting to switch build systems. When we did, we noticed something unusual. On the left, these are the same steps we were looking at before, that was in the previous build system, and they're remarkably consistent. On the right, they are very noisy.

We looked at a little bit more detail about the implementation of these build systems, it turns out one was running in VMs and the other was running in containers. That was part of what made parallelization so easy is that each job was very well self-contained, you could run them in parallel very easily, but containers have larger co-tenancy issues. They also start up faster. There are a bunch of attributes about this graph that we didn't necessarily expect. When we started doing this work, because we had this instrumentation available, it was able to uncover some previously unknown attributes of the work that we were doing. That's super interesting.

Fast forward another five months, we have a nice group of data here, that variability was very widely evident. This is the same graph we were looking at before, yarn. The 95th percentile across the top, it didn't really get faster. On the bottom, we see the distribution of time it took for that job to run. We can see that on the left side, when we were running in VMs, it was pretty tightly constrained, it was a consistent runtime. The right side, much wider variants, actually a little faster overall. It wasn't immediately clear that the switch to containers and the different way of running these builds succeeded in its job, we wanted to make the build faster.

When you look at all of the commands visually at the same time, this is a stacked graph of each of the steps. In the top graph, we see there's a little bit of time savings. In the bottom one, we see the distribution is actually about the same. When you graph the total build time, it becomes very visible what happened. Because of that parallelization, because we could run jobs concurrently, some of the longer jobs that we were noticing before in that waterfall graph no longer impact the total build time, which is great. That was our goal.

Looking at an individual build again, a number of these steps that have continued to get longer, no longer influence the total runtime of the build. The go tests, that's this bar up here, it is the longest job. The dependency goes from JavaScript dependencies to building the JavaScript to artifacts creation to deploy. Those two combined are longer. We succeeded. The build times dropped by about 40%, and we were able to verify, using the instrumentation, that our goal for this change was, in fact, achieved.

It still lets us guide our work going forward, too. When someone comes along and says, "I want to make the build go faster again," it's going to continue to grow and not that much for sure. They come along and say, "We should optimize the go tests. We could run with more concurrency in such and such." We can look at this graph and say, "No, you shouldn't. That's wasted work." It's not influencing the critical path of this build's waterfall. The go tests can take almost twice as long as they do now, and the build would still run at exactly the same pace. The instrumentation has allowed us to understand better what work will be useful and what work we can put off till later.

Adding instrumentation to the build changed our model. It changed our understanding of how all of the pieces in the build relate to each other. It allowed us to choose which work will be most effective at the time. It proved the results that we went into this idea of we should switch up our build system with the hope that it would make the build faster. We uncovered some aspects of that we didn't know before but, overall, got confirmation that, in fact, in aggregate, our builds are faster and our developers are happier. The variance is wider, which is ok sometimes. Again, it can feel very long when you're waiting for the build. That's a trade-off that we got in exchange for running things in parallel.

What we did here was we took a tool that we normally use for understanding our production applications, distributed tracing. We applied it to what's mostly a black-box operation, running a build. By applying it there, actually, got a lot more insight into this process that was previously just something that happens and we were able to get direct benefit from it as developers.

Not only that, but the developers that are building the application, who maybe haven't spent so much time running and building production infrastructure and code, now have a familiar tool that they can use to look at this part of the life cycle that they haven't really paid attention to. It lowers the barrier of entry so that while maybe they're used to working in their Go code base, they're now ok going over to the YAML config file that is running this build and tweaking it and then being able to understand the effects of the changes that they've made.

How Does It Work?

Those are the pretty pictures. That's the reason why we are doing this thing and what we get out of it. Now, I want to tell you how it happened. Back at the beginning of that story, I said Honeycomb believes in observability and we wanted to instrument these things. We instrumented it and then went on into the pretty pictures. That's the whole section that now expands into this part two.

In order to instrument it effectively, we needed to define the problem a little bit better. I've been using the word build system; the software supply chain track definitely has that as part of its vocabulary, but the last talk gave a fantastic definition of what CICD means. I'm going to ignore most of it because, for the purpose of this talk, I'm coming at it from the perspective of the code being built rather than somebody developing a build system.

In this talk, by build system, or CI, or CICD, what I mean is a system that will really do two things, test code and build it. It usually happens on some trigger or timer. It does a bunch of stuff in order to make an artifact. They generally pass or fail, and they do this by running a bunch of shell commands. I know there's a lot more to build systems than that. If any of you run build systems, I understand they're super complex, but from the perspective of the code being built, this lets us set up an abstraction that we can use to make the instrumentation job easier.

The core duties, from the perspective of this code being built, is that it sets up an environment that is isolated or repeatable. In that environment, you run a bunch of commands. You stop when one fails. Then in some way you record the result: the build passed, the build failed. What we were looking at earlier were some of the commands within that build, and we're going to get into how we got that data.

When looking at each one of these commands, as well as the entire build, there are a couple of criteria we're going to look for, for the instrumentation to admit, in order for us to draw a representation of that build and collect that data over time. We want to know how long it took. We want to know what commands were run and how long each one took. We want to know whether the build passed. If it did not pass, we want to know where in the sequence of things did it fail. Let's look at the same list from the perspective of the waterfall graph to highlight where we're going.

How long did the build take? That's up at the top in red. What were all of the individual commands? That's this list along the left. How long each one took? That's the numbers next to the bars. Did they succeed or fail? That's the column in the middle. We've connected what we want from the instrumentation and how we're going to represent it. Now, we just need to collect it.

To anchor this in an actual build, I want to use a sample configuration. This is a configuration for a build in Travis CI, a SaaS-build provider. For those of you who haven't read Travis configs, what this is saying is, "I'm going to build a Go project. I'm going to produce some artifacts. That's the paths where I'm going to find those artifacts. There's a script section that says, 'I am going to run some go tests.' If those tests passed, I run the after success section and it will create the binary that's then picked up by the thing up top that publishes the artifact." This is what we're working with, very simple config. It runs tests, it builds a binary, publishes it.

Those are regular shell commands: go test, go install. We're going to take advantage of the fact that this is run inside a shell, and we're going to write a shell wrapper. It will take a name for the people looking at it to see what's going on, and it will run all of the rest of the arguments in a sub-shell, and then report back how long it took. It will put that data somewhere. We're just going to send it to standard out. Because I want to start simple, I want to start with a proof of concept to show that, in fact, this idea of adding instrumentation to a build system actually is not super complex.

Forgive me, I'm making you read Bash. We'll get through it quickly, I promise. We are taking the first argument as a name. The point of the instrumentation is not just recording the results, but making them friendly for people to understand. We don't want just their command line, we want a name that people will look at and recognize and understand. It takes the first argument to name, it records when it started. It eval, it runs all of the rest of the arguments as the command it's actually supposed to run records what happened, and then sends it to standard out.

20 lines of Bash, we have instrumentation. Let's see it run. On a shell, we run this Bash script, give it a name, fancysleep, and we, in fact, sleep for four seconds and verify that, yes, the duration is four. Our shell wrapper works. Just to make sure that we're passing the exit code along, we try one with the command false. It does nothing but fail immediately. In fact, the shell agrees. It's important to pass the exit code along. That's a critical part of this because that's how the build system understands whether your tests passed or failed – they use the exit code from the command that ran.

We got it to send to standard out, which is fine, but we want to aggregate this data somewhere off-host. We don't want to look at the actual output of the build config. Instead of using echo, we're going to use curl and we're just going to send it off to some third party. I'm sending it to Honeycomb because that's what we have available. It can go to any sync that will accept structured data. It's just sending a name equal value set of three pairs here.

Let's put it in the config and then we'll basically be done with an instrumented build config. It does get a little bit more verbose, the lines get a little bit longer. Thankfully, we're not really limited to 80 characters anymore. We checked this tool into our source code repository and prepend each of these commands. At this point, we've gotten to a build that when it runs will record how long should each of these commands took with the bits that we wanted to represent and send them off to someplace for aggregation and later display. That wasn't too hard.

We do want to get a little bit fancier. That was a prototype. It didn't really collect a lot of useful information beyond just the timing. There are a couple of things we need to do to make it graduated from a prototype to something that really will be useful. We're outgrowing our Bash-scripted longer than 20 lines, definitely time for a new language. More importantly, we want to take each of these commands that were run and link them together in some way that they represent more than just individual runs. We are trying to represent a build. It's not just a sequence of commands, it's a sequence of commands that are run in order and they are collected. If I'm running two builds at the same time, I can't just mesh the runs together. It doesn't make any more sense. We need to link them all together.

As part of that, we want to improve the data model a little bit. We started out saying build system runs commands. That's it. As we saw from the Travis script, there are a couple of different phases. It had the script section and the after success section. That's useful. That's useful information because it takes these atomic units of running commands and bundles them into chunks that are more consistent over time. Finally, there's a lot more context present that would be really useful to gather. Builds in the life cycle are kicked off often by pull requests. We should be able to ask, "I have this pull request, show me the build and link that together somehow." There's a whole lot of context that goes beyond, just let's look at name and duration.

We're going to talk through each of those last three things for a moment. Linking the commands together. This one sounds complicated because now we're switching from individual commands to this idea of using a trace model. Thankfully, build systems generally give you a unique identifier representing that build just as part of the environment in which they run. Travis CI uses an environment variable called build ID. We're going to use that and allow the person doing the instrumentation to say, "Ok, here's your identifier that will tie these commands together."

The first argument there is going to be the build ID. We add this to every invocation of the build events. Build events is what we call the version that's not Bash. Every invocation of build events gets this build ID, now it knows, "Ok, these commands were related, they came together, and I'll keep them together when we start looking at traces."

Next up, we want to improve the data model a little bit. We want to get beyond individual commands and to groups of commands, steps. I was very happy that in the previous talk on Tekton, Tekton, in fact, actually has the same units of work as represented in these other build systems, I hadn't come across it before. That was awesome. The idea is, you have a bunch of commands, you group them into tasks, or steps, or jobs. CircleCI talks about jobs. The job will be a slightly higher level of abstraction. It will be, "I'm going to test all of this code," or "I'm going to build and deploy an artifact," or something like that. Finally, the build is the whole thing put together.

In this example, we want to group those two commands together, not just with the build ID, but with something else saying, "We are part of this same larger task within the build." We're going to create our own identifier this time. In the Travis CI world, each phase is executed exactly once, so we use just the name of the phase. In CircleCI, you can use the job ID. There is generally an identifier available representing, "Here is this group of commands," and that's what we're going to use for the identifier to tie those two commands together.

Finally, we're going to time the whole block. We're going to use exactly the same trick we had in that Bash script at the beginning, we're going to record the time at the beginning and then use it at the end. Now we have a duration.

We've accomplished those two goals, improving the data model, by grouping these commands together, and our command line has gotten quite a bit longer. This is YAML, it's not something you're typing out, that's ok. There are ways you can make it shorter again but that's a syntactic optimization and I'll leave that for after the talk if you want to talk about it more.

This is the visual representation of the same grouping. In the first traces, we just had a linear list of all of the commands that are being run. Now we have this grouping called JavaScript dependencies, in it are three commands. In the waterfall diagram, originally, we had just the purple bar that represented the whole build and then all the jobs below it. Now we have a repetition of that same structure where the greenish bar represents the JavaScript dependencies task, even though it's got a couple of smaller tasks within it.

Onto additional context. There's some context that you can think of, "We're in a build system, it obviously makes sense for us to collect these things." What is the name of the branch that we're building? Assuming we're using some source code control like it. If it came from a PR, what is the number of that PR? If we have multiple build systems that we're using, because we have very large projects, which one are we in? Who is the person that made the commit that triggered the build? All of these things are generally available either in the environment of the build system or via the source code control repository that you are working in to do the build. You can ask get for things, or you can just collect them directly from the environment. Those are easy. They come along for the ride.

The most useful ones that I've seen so far are branch name because it helps you find the specific thing you're working on – PR number because you can pull in links directly from pull requests. Depending on the build system we're in, they're named different things, but all of that just comes along for the ride. Far more interesting are custom fields that the developer, the person architecting this build wants to add in order to better understand their particular jobs. In our case, artifact size is very interesting. The size of the things that you ship to the browsers has a direct impact on page load time. How often you rev those web-packed bits directly correlates to your user experience on your site. Tracking artifact size is hugely useful when you're doing builds over time.

Build depth is a really neat idea that I'm not going to talk about now because I don't fully understand it. It's complicated, but it lets you talk about which commits landed in a different data model of the the get tree. The point here is that we are opening the door for the people using the build system to add their own data because it's relevant to their business. There's one great story about somebody who instrumented the build and included, in the output of the test command, more detail about which test failed, and then could generate reports on test flakiness. That directly guided work towards making their builds more reliable and more stable. By making sure to include a generic API through which the developer using the system can add content, you open the door for unlimited new things to come out of it.

The mechanics of adding additional context are quite simple. Whatever job you're running, whether it's part of the build or within a command that's run, drops a file in the local file system with some name-value pairs and then pushes that filename into the environment and then they get slurped up and go along for the ride. We can make some pretty graphs. The one in the background is the asset size for our builds over the past four months, I think. It's relatively consistent. We haven't had to fight that battle yet. The lower graph shows the number of builds on a per-repository and committer basis for all of the open-source repositories under the Honeycomb organization in GitHub.

That was just neat. I don't actually have a particular reason to pull that one up. It's just it's neat to see there's a steady stream of commits going to all these different repositories and you can ask your questions about who's doing which work and see which repositories are hot? What's getting a lot of activity in different times?

Everything so far has been from the perspective of the job being built. We have put that constraint on our service because it gives us interoperability with all of the different build systems. If we choose to go beyond that and say, "No, we know we're in a specific build system," in this example CircleCI, we can reach out to their API and we can get additional data. That's additional context and has some real benefits over the information that's available from within the environment.

For CircleCI, in particular, there's a method of calling out to the API and asking it, "This job that I'm in right now, this one that's running, when did it start?" Their API is sufficiently up-to-date that, in fact, you can get the start time for the currently running job. That's super interesting because it lets you now get a more accurate time when you started. When you record the start time from within the build config, what you're really recording is the job started, the container spun up, the dependencies were loaded, and then you get to run your stuff.

By being able to talk to the API, now we're getting information not about when we were able to start running, but when the build actually started. It more closely maps to our developers' experience of when they hit commit or merge, and the build starts running. The data we get is more accurate. That's a big benefit, it gives us more confidence in that code. It's hard to maintain because the API for Circle is radically different from the API for any of the others. This is when all of you come in. We use CircleCI at Honeycomb, and so we've built that integration. It gives us some better data for our builds.

Thanks to community contributions, the build events binary works and knows about environment variables for all of the build systems across the bottom there. It doesn't talk to any other APIs, but if any of you are interested in trying it out and use one of those APIs, we welcome contributions and we'd love to see it be able to interact with APIs for a couple other targets. The QR code here is the GitHub repository for build events.

To recap, we wanted to understand how the build system works from the perspective of being cross-platform and being able to run within the build system itself without having control of the build system. We wanted to get good quality instrumentation out of it. By recognizing that we run shell commands and putting in a little wrapper, we could get that instrumentation and use the characteristics of the build system, the environment and local file system for inter-process communication. We get extra content from both the environment and from APIs, and we can collect that data and send it off to third party for visualization.

What's Next?

Where are we going next? Build systems are just one small little corner of the software supply chain. It's great to visualize them. We've gotten a lot of value from just that part. The DORA Report talks about the four key metrics for high-performing teams. One of those metrics is the time from commit to deploy, the time between when you're done with your task and when it's finally running in production. Building is a part of that, but it's just one part. I would really like to see the same kind of visualization for more parts of the software supply chain.

I'm not convinced that tracing is necessarily the right representation, but let's talk through a little bit of this commit to deploy lifecycle to try and flush out something about what attributes would be a useful way of modeling it in the same way that we started by talking about the build system and what the core requirements were for instrumentation.

There's a whole lot of subtlety to the whole life cycle and I don't pretend to cover it all. In the most general sense, if you think of yourself writing code, it often winds up in several commits that then open a pull request. Then you get some reviews and feedback and you make changes, and submit those changes and review the pull request. Eventually, it gets stamped. Each of those changes, every time you add to this pull request, you're triggering more builds.

Finally, when it does get stamped and merged, it's going to get deployed somewhere. Maybe every merge goes straight to production, maybe not. There might be testing or staging environments to go through first. Then when it does finally get out to production, there are different ways of doing that. Each PR could result in a deploy to production, or perhaps they're on a timer, so every hour, or day, or week or whatever interval, it collects up all of the changes and pushes those out. The code might be reverted. There are all of these bits about this cycle, this life cycle, that aren't particularly linear, there are some loops.

It's an interesting question to think of, what is the appropriate model for that collection of changes? From the perspective of what are the questions that we might ask, we want to be able to represent an individual lifecycle, an individual commit to deploy life cycle well. Looking at any individual commit, you want to know, what was its path forward? When did it start? What cycles did it go through? When did it finally wind up where?

You want to go the other direction too: what just got deployed? It had a bunch of PRs, it had a bunch of commits. How can you start from the deploy and go backwards? It needs to be able to represent these cycles and delays in a meaningful way. The delays, in particular, are interesting because there are many phases in this life cycle that are machine-timescale. The deploy might take a couple of minutes, but it's a short burst of concentrated activity. There are many phases that are more human timescale. You might commit code at 4 p.m. and open a PR and I might get to it the next day at 10. There's an enormous amount time-gap between those two things, I might spend some time on it. There's this mix of timescales that makes it difficult to use any kind of strict temporal representation.

Let's assume we got to the point where we can represent one run, we want to represent many runs in aggregate. Looking at the build as a single trace, we gained significant insight. It was super interesting, we could see what each phase was doing. Looking at them in aggregate, we gained different insight. Both are very valuable. In the aggregate form, we want to be able to understand what is the overall lead time, the time from commit to deploy, and how does that change over time?

There are a couple of other timings that are super interesting. The time from merge to deploy might give you more insight into the mechanics of your deployment infrastructure, separate from the total life cycle. I know my team talks a lot about the timings in PR review delay. Someone will pipe up in the morning, "I owe somebody five PRs." Being able to track that in a good way, being able to measure it lets you improve it, lets you understand what are the appropriate ways that you want to improve and how can you verify that you've improved?

Finally, in order to do all of this, it has to fit in the entirety of the toolchain. If we move towards systems like the talk on Tekton, that makes it a little bit easier because there's a more concentrated API. Already, we're working with source-code repositories, a variety of build systems, a variety of deployment infrastructure, some of it in-house, some of it more vendored, but there's a lot of APIs to work with there. That's going to be a tricky part to accomplish from a purely organizational perspective.

I got the build system done, who wants to do the rest of the lifecycle? I mean, literally, I really hope that next year, at this talk, someone else is up here on stage saying, "Here's a beautiful visualization of this part of our workflow. Here are ways in which we applied this measurement and this instrumentation in order to make our developers' lives better. Here's the way that you can do it too." I don't know that I'll be able to do that and I really hope that somebody here can pick up the torch and carry that part along.

It's important to make sure that every section of this work that we do includes the appropriate hooks in order to do it. The build system was neat because it runs Bash commands, we can use shell wrappers, they're off-the-shelf tools. We were able to apply the visualizations from Honeycomb and pull what was built as something for production and apply it to this other bit. I know you all have a lot of experience with different ways of instrumentation, different representations of that, and I hope there's something out there that will make this work really well.


Here's the recap. Instrumentation and visualization let you guide your work, do better work, and prove that your work is effective. Build systems were an easy place to apply this and get very actionable insight. More generally, taking the tools that we have developed to understand the complex production environments that we've built, enormous microservice architectures and everything else, and applying them to this aspect of the developer lifecycle, the developer workflow, lets us reuse a lot of those things in a way that might not be commercially viable. Honeycomb could never survive if its customers were all just instrumenting build systems because it makes just a tiny bit of data. It's super easy. By using those same tools, we can take advantage of all of that work.

Finally, taking these ideas from the larger cultural movement of observability and applying them outside of the areas in which are they're most commonly discussed, production environments and testing in production environments, and understanding the impacts in the performance of applications running in this highly-contentious area, and applying them to the software supply chain will take us to a better place and let us continue working our jobs and enjoying them.

Questions and Answers

Participant 1: Did you figure out why the container builds were so variable?

Hartshorne: I had some chats with some folks at CircleCI and the easiest answer is co-tenancy. There are multiple containers running on single servers. They do this because it's an effective way of managing resources. This is part of being cloud-native and it's part of the design of the system, so long as it fits, it's fine.

Participant 2: You mentioned briefly about doing unit test results. Does it handle having extremely large sets of those unit test results? Or how would you recommend splitting that up or do you have any pointers on that?

Hartshorne: I didn't run that project. This was actually a Honeycomb customer that did this and was telling me the story. I'm not sure how they pulled it off. In general, the ability to get good results out of the instrumentation you send will depend upon the system to which you're sending it. We used Honeycomb, we're built up around very wide events with many different fields, so it tends to work out ok. I think they were piping the information to maybe in ELK stack and using that for analysis. I'm afraid I'm a little fuzzy on the details of that particular one.

Participant 3: Any thoughts about how to enforce the culture of instrumentation build system in a large organization where there are different speeds of software [inaudible 00:44:39]? Any thoughts about that other than bringing in Honeycomb [inaudible 00:44:47]?

Hartshorne: I hope I'm getting it right: Do I have thoughts on how to encourage adoption of the culture of instrumentation and observability of the build system in larger organizations that have many software suites and applications and maybe even many build systems? I'm afraid I really don't. The idea of observability – and this is as applicable to production infrastructure as it is to build systems – if you identify the business value of the instrumentation and the effects of adding that instrumentation, you can convince not just the practitioners who are the ones who are going to get direct value from it, but also the business when you can connect it back to user experience or money spent or anything like that.

I'm no expert on how to create cultural change within an organization, but the actual effects of having this data have been just blatantly obvious and very useful, not just to the folks that are in it every day. Other than do it and show that it's useful and go from there, I hope other folks will help you work that problem.

Participant 4: What are the learnings of having the instrumentation code directly in the build config versus having it as passive agents or something like that? What are the learnings so far?

Hartshorne: Honeycomb believes in SaaS products and we use third parties to build all of our code. We don't have access to the build system. We couldn't run an agent on Travis CI or CircleCI. The real reason we did it this way initially is so that we could run it anywhere. By keeping it within the build config, we've isolated ourselves from dependencies on the mechanics of the build system itself.

For companies that either host their own build system or something like that, it's certainly attractive. You can get different types of data when you interact with the build system as well. I think the true answer is go both ways. You can do it easily as somebody without any control other than the ability to influence your build config, and then as you have gotten it more in place and can show the benefit, move on and get better quality data from the build system itself.


See more presentations with transcripts


Recorded at:

Feb 03, 2020