InfoQ Homepage Podcasts Armon Dadgar on HashiCorp Research, the Evolution of Infrastructure Tooling, and Standardisation

Architecture & Design

Armon Dadgar on HashiCorp Research, the Evolution of Infrastructure Tooling, and Standardisation

Aug 02, 2019

Academic computer science research potentially has a lot to offer the mainstream computing industry, but the two disciplines don’t always work well together. The HashiCorp team purposely embraced the research community when they began building their infrastructure tools, and they have now contributed both novel research and also the results of practical implementations of research algorithms that are being run at large scale.

On this podcast, we’re talking to Armon Dadgar, co-founder and CTO of HashiCorp. Alongside Mitchell Hashimoto, Armon founded HashiCorp over six years ago, and the company has gone from strength to strength, with their open source infrastructure product suite now consisting of Consul, Nomad, Vault and Terraform.

We discuss the formation of the HashiCorp research division and explore some of the computer science research underpinning Consul and Nomad. We also cover the challenges of supporting teams when they are looking to embrace new modes of working with dynamic infrastructure, and Armon introduces the new learn.hashicorp.com educational website and accompanying community and support forums.

Key Takeaways

There is a lot of fundamental computer science research that underpins the HashiCorp infrastructure workflow and configuration tooling. This helps to ensure that these mission-critical tools perform as expected, and enables sound reasoning about scaling these technologies.
The HashiCorp founders recognised the value of creating an industrial research-focused department within the company even when there were only 30 staff.
The Consul service mesh and distributed key value store leverages consensus and gossip algorithms from computer science research, Raft and SWIM, respectively. The HashiCorp team contributed a novel research-based improvement to SWIM -- Lifeguard: SWIM-ing with Situational Awareness -- that was presented at the DSN academic conference
Initially HashiCorp produced a new tool every 6-12 months, focusing on filling gaps within the infrastructure workflow tooling market. Now the focus is on refining the operator/user experience of the existing tools, creating more integrations with other platforms and tooling, and facilitating engineering teams adopting these tools, via the creation of educational resources and community forums.
Standardisation within computing technology can offer many benefits, especially where interoperability is required or technology switching costs are high. Care must be taken to ensure the correct interfaces are created, and that the time is right to create appropriate abstractions.
The HashiCorp team are focusing on "marching up the stack", with the goal that a lot of the underlying "plumbing" should be hidden from, or easily configurable by, application developers. This will allow developers to focus on adding value related to their business or organisation, rather than getting stuck with managing infrastructure.

Subscribe on:

Show Notes

What was the motivation for your QConNY 2019 talk on research in the real world? -

03:15 It's an interesting question - some people who follow HashiCorp closely will know we have HashiCorp Research.
03:25 It's unusual for a start-up to have a research department.
03:30 We first started HashiCorp Research when we were 30 people; so we had a really tiny crew.
03:40 It's a question we often get: why do we have a research group, what's their goal, how do they integrate with engineering?
03:45 The other thing people notice about HashiCorp is how much fundamental computer science is at the base of it.
04:05 When Cindy Sridharan asked us to come and give a talk about it, it seemed like an ideal thing to do.

What's your research team's proudest achievement? -

04:35 The two that have the most research baked-in is Consul and Nomad.
04:40 We've done a good job in making them invisible.
04:45 When you look at what's under the hood in Consul, there's a consensus algorithm Raft and a bunch of extensions of how we do data consistency.
04:55 It has a distributed Gossip layer as well, and that has a whole bunch of extensions from different pieces of the literature.
05:05 On top of that we have a network co-ordinate system: every node has a GPS-like identifier which is a six-dimensional co-ordinate, which can be used to calculate the distance between any two nodes.
05:25 It is usually accurate to within 10%, so you can say two nodes are 5ms apart on the network.
05:30 This allows us to route to the nearest node on the network; the nearest MySQL or memcache based on a network basis.
05:40 The list keeps going for things we've baked into Consul.
05:45 The user experience is that it just works; you don't think about all the technology under the hood.

Does HashiCorp publish research papers? -

06:05 Yes - we have our own novel extension to the Gossip protocol (called Lifeguard) that we published last year.
06:10 Twitch was a big user of Consul, and they talked publicly about their usage when they were being DDoS'd we were impacted on the front end and some of their back-end Consul services.
06:30 We found that there was a problem with Gossip, that it is fundamentally co-operative: if you tell me a node is dead, I believe it.
06:45 If you are unhealthy, and you're telling me that another node is unhealthy, you're "poisoning the well" of these other nodes.
06:50 The key question is how do you make the system resilient to the fact that nodes that are co-operating might themselves not be healthy or trustworthy.
06:55 We published this paper on Lifeguard - swimming with situational awareness - and that was fun to give back to the community - both Chef and Cassandra adopted it.

Did you present that at an academic conference? -

07:30 We put the pre-publication on Arxiv, which is becoming the norm now, but we did present at DSN (2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops) which is a conference with an industry track.
07:50 The basis of the underlying paper is SWIM, and the original SWIM authors (An Das, Indranil Gupta, Ashish Motivala) e-mailed us afterwards and were thankful for extending their work.
08:05 It was good to engage with the original authors of this work.

Standing on the shoulder of giants, as it were. -

08:40 I asked whether I wanted to stay in academia or go into industry - because most of my academic work was in research.
08:45 I ended up dropping out from my PhD program to join a start-up, but I was torn because I loved that side of it.
09:00 It was important to bring that approach back to HashiCorp, which is why we are fairly academic for a start-up.

How do you balance the industry need to make money with the research goals of academia? -

09:30 I'm going to talk about this during my talk - part of it is the charter of the group is typically industrial research.
09:40 That's an important distinction from, say, Microsoft Research, where you could be a researcher on the theoretical bounds of some algorithm from N^40 to N^38.
09:50 That's interesting from an academic perspective, but there's no practical impact on an algorithm that is N^38 complexity.
09:55 That's good work for academic research, but not something HashiCorp can invest in.
10:05 The industrial focus of our group means that it is something that's applicable to our products or is something within 18-24 months of what the market will be able to use.
10:15 Work on Lifeguard makes the system much more stable without it - so that was practical benefits for large clusters today.
10:30 Now we're working on automated configuration around Vault; that's a problem for today.
10:40 If it's something that's going to end up in a product, then it's easier to justify that investment.

You need to place your bets accordingly. -

11:00 If our engineering group is the first horizon; the research group is the second horizon.
11:05 If we get big enough one day to invest in the third horizon we might aim for moonshots.

Can you give me a quick overview of where HashiCorp is at the moment? -

11:40 If I looked across the portfolio, we haven't released a new tool since 2015.
11:45 The first few years of the company, we were releasing new tools every 6-8 months.
11:50 Our view was (at the time) there were huge gaps in the market.
11:55 You need a tool to do provisioning (like Terraform); you need a product to do security (like Volt); you need a thing to do networking (like Consul); you need a thing to do scheduling (like Nomad) - we were building the missing gaps.
12:10 Since then we've been on a constant refinement process.
12:15 How do you go from a tool needing an expert level user to one where it's less intimidating?
12:25 That's a big investment, in terms of sanding down the edges, getting the getting started guides up to quality, making the UIs really nice, polishing the CLI.
12:30 Some of that's just documentation at HashiCorp.
12:40 The other thing is that infrastructure lives or dies with its integrations - no infrastructure that is successful is hermetically sealed.
12:50 We saw a bit of that mismanagement with Docker Swarm, where the product did not integrate with the ecosystem around it.
12:55 We view that you have to integrate with everything, whether it's Kubernetes, or the serverless offerings that cloud providers have.
13:10 The third one is the maturing of the tools; making them stable under scale, making them "enterprise ready", all of that work, taking from it a 0.1 to a battle hardened production environments.
13:30 Those have been our main investments and if you look at where we're going, it's those same key trends.
13:40 We aren't going for a radical departure, as our systems are used in production.
13:45 It has to be a little more incremental about those big themes.

What about the tooling and training? -

14:20 What we've seen is that the market has said that the tooling matters.
14:30 If you want to bring Terraform into an organisation, you need more than one person to be able to know how to use it; so how do we train 50 people to use it?
14:40 That's where we made a big investment: learn.hashicorp.com, our big learning platform with self-guides and tutorials.
14:45 There's a team of 18 people who work on the content behind that.
14:55 The community has been fragmented across many different places; stackoverflow, gitter, IRC, Slack, Google Group - all over the place.
15:10 You end up answering the same question in ten different forums ten times.
15:15 We wanted to bring all of that together into a community portal, so that when you search for a question you find the answer - it's like a stackoverflow for HashiCorp.
15:30 That's important when you talk about how to bring people along the journey with you - there's one place to go to get answers.
15:45 Later this year we're going to be looking at certification - what we hear from companies is that they want people who know how to use Terraform rather than just putting it on their resume.
16:05 Part of the value of certification is being able to allow experts to signal the value to employers, and for employers to have a baseline competence.

Will HashiCorp be looking to provide guidance on what technologies people should be using when moving to the cloud? -

17:25 It's a tough question: you never see technologies standardise; companies have everything from mainframes, bare metal, vm, containers ...
17:50 The reality is that very rarely do we see companies standardise on specific technologies top-down; it usually ends up being a hodge-podge of having it all.
18:00 It goes back to having the notion of integration: knowing that you have it all, with a provisioning tool for containers and a provisioning tool for serverless; just have Terraform and it can handle those.
18:20 That is the reality for most organisations.

What's your thoughts around standardisation of key layers of the stack? -

18:55 I have torn views; obviously we are a big participant in service mesh interface (SMI).
19:00 SMI particularly is interesting at the network layer.
19:15 The network is the common denominator between mainframes and serverless with TCP.
19:20 You can span 40 years of technology because the network makes that possible.
19:25 At certain levels, you need standardisation because you need to have interoperability.
19:35 If you don't need interoperability, like the vSphere API used to boot a VM, that API has nothing to do with OpenStack.
19:50 I don't need interop between vSphere and VMware - but for the network you need interop.
19:55 So you need a standard when interop is important; the other place is where the switching cost would be too high.
20:05 A good example of that would be tracing; you want to have one type of SDK to export the telemetry, but you could have datadog, appd, or roll your own to collect it.
20:25 You wouldn't want to use a proprietary format in all of your applications and then have to go and retool all of them.
20:30 Those become some of the key points where an interface makes sense.
20:45 The other thing is the markets are so dynamic and rapidly evolving, that it's very hard to build these standards.
20:55 If you understand 95% of the space, then you can have an interface for that and evolve the other 5%, but if you only understand 5% of the space and you are still evolving the other 95% then it's a much more difficult thing to do.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.