InfoQ Homepage Presentations Scaling Infrastructure Engineering at Slack

Scaling Infrastructure Engineering at Slack

View Presentation

Speed:

Download

48:22

Summary

Julia Grace was asked to build Slack’s first infrastructure engineering organization in August 2016. The company was two years old and they were approaching the scalability limits of the original infrastructure. Things were starting to break in strange and unpredictable ways. She discusses the architectural and organizational challenges, mistakes and war stories of 2.5 years that followed.

Bio

Julia Grace is currently a senior director of product engineering at Slack focused on building network effects into Slack through shared channels. Prior to joining product engineering she built the infrastructure team at Slack, growing it from from 10 to 100 engineers in three offices in two years.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Grace: We're going to go back in time to 2015. How many of you remember 2015? It was a good year generally speaking. I started using Slack in 2014 before I ever thought about working there. Long story short, I was at my own company, I sold it. I was using Slack every day and I was building apps on the developer APIs. In 2015 after I sold my company, I'm trying to decide what I want to do, I take out my iPhone. I look at all the apps on my home screen and there's Slack. I joined in the fall of 2015.

Now at that point in the company, there was no infrastructure organization. We were only a couple years out of transitioning from Glitch, the game, to Slack, the communication and messaging tool. In the beginning, I see this a lot at startups, there is product engineering and product engineering exists to build features. I joined developer platform, that was the first organization that was split out of product engineering that didn't build features. We built the rich APIs that developers use. How many of you have developed on top of slack before? There were less than100 engineers at the company. The whole company was under 300 people, it really was a startup at that point in time.

This is a daily active user growth graph. You'll see I've blocked out everything past 2015 because remember we're still in 2015 and you could see that the daily active user growth was starting to grow. When I was using it at my small company, we had substantially less usage and now things are starting to take off.

Phase 1: 2016

Now, it's 2016, the CTO, Cal Henderson, and our VP of engineering, Michael Lopp, they come to me and they say, ''Hey, Julia, we have a challenge for you. The original infrastructure that was built for Slack that was built by the founders, it was not scaling to keep up with what we needed.'' Remember, we didn't have an infrastructure organization. Our infrastructure consisted of one service written in Java and then a heavily shorted MySQL. I remember after the CTO comes to me, I didn't immediately say yes and I was thinking about, do I want to build an infrastructure org? It was a really scary challenge at the time.

I was reading Satya Nadella's book at that point in time and I remember this passage from it and it reminded me of what I was about to embark on over the next few years, the journey that I'm going to take you all through. See, Ballmer says to Satya Nadella when Satya Nadella is about to take over the cloud business from Microsoft, which at the time it was really playing catch-up to AWS. He says, ''Look, this is the most important challenge I have. I don't think this is even a smart decision for you, but I want you to do it. Think wisely and choose. By the way, if you fail, there is no parachute. It's not like I'm going to rescue you and put you in your old job.''

I'm faced with this really hard decision. We had to scale the infrastructure, it was an existential challenge for the business and we couldn't fail. That was what was like in 2016. Now, as you can probably imagine, after a lot of trepidation and a lot of late nights, I decided to say yes. What happened? Now, we're starting to see more growth. The growth is continuing to grow and there's a lot of axes of growth at this point in time in the company. There's daily active user growth, there's more and more companies signing up to use the product. There's headcount growth, we're growing engineering astronomically. A lot is going on during this point in time in 2016.

It's important to know as we talk about infrastructure, that Slack was originally designed for small and medium teams and you make very different architectural decisions when you're building for teams of 10 users, 50 users as compared to hundreds and hundreds of thousands of users.

As I mentioned before, we had a handful of engineers doing Infra work. Many of them were in product engineering. At the time, they were always frustrated because when you're in product engineering and you're doing infrastructure-shaped work, a lot of the time the features come before building and investing in the infrastructures. They were continually frustrated, they wanted to build Infra, but we didn't have this org. I assemble all those engineers together. By this point in time, we had about 150 engineers total at the company. We didn't start Infra from day one, we started many, several years later.

The best part of this was things were starting to break. It was never what we thought would break because the things that we planned for, those really hairy single points of failure, we often knew how to solve issues related to those pieces. It was all the weird stuff on the fringes and the edges. It's this beautiful thing that happens when people start using your product in ways that you never imagined. They start building on your APIs, which then puts a fascinating load on your infrastructure in ways that you never thought of.

User Presence

I'm going to give you a very specific example. This one is fabulous because it actually has a user-facing component. A lot of infrastructure is behind the scenes and when it works, you never notice, but this one you'll notice.

How many of you have built user-presence systems before? Online, offline away. A couple of hands. Before I worked at Slack, I didn't understand how complex it was to build presence. What I mean by that is, let's go through an example. You got your laptop open, you're using Slack, your green dot is illuminated. You're online, you close your laptop. Are you away? Maybe you immediately open it. Maybe the internet happens to go down and come up. Maybe you close your laptop, but then you go to mobile so you're away, but then you're back online. How long should we wait until we say that you're away and how long is away really? How long are you away before you become offline? This is one of these beautiful systems that it's not binary. You're online or offline. There's this gradation of a way that would keep us up at night.

People expect presence to just work and if it doesn't work, that's a really jarring experience. I remember in 2016, there are apps and bots that people build on top of Slack. If somebody would get a message from a bot and the bot was offline, it was really jarring to them. Or if they would message a bot and the bot was away - how could the bot even be away? What does that even mean? This is why to this day, all bots and apps - if you use, we have a fantastic Google Drive integration - if you use any of those bots or apps, they're always online to ensure you can sleep at night. At the time we formed infrastructure, presence updates where broadcast to every single person in the organization.

Let's take a slight deep dive into how Slack works. If you're on your desktop computer, Linux, Windows, whatever flavor of operating system you like, you open up the computer and we establish a web socket connection. The beautiful thing about a web socket connection is we're sending a lot of bi-directional data about what's going on, new Channels, new mentions. Oh, and guess what? Sally is offline. Bob Is away. This was 80% of the traffic that we were sending, were these updates.

If you'll remember, in the early days for small and medium companies, you probably actually cared if you had access to see if everyone was online, offline or away because you were talking to every single person in Slack. When I was using it at my startup, I would talk to every person, I actually wanted to know. When you get into the hundreds, the single thousands, tens of thousands, hundreds of thousands of users, let me tell you, you really don't care if someone you've never talked to has gone from online to away, but had access to this information.

Let’s say we form our group and suddenly we had this almost a doomsday clock ticking of when we had projected out the business of when the present situation was going to crash, when we were going to be crashed under this load. Really rapidly, we had to change the semantics of how presence worked. I'm not going to go into too much detail, we can talk about it after, but we had to transition from a broadcast to a published-subscribed model. That's the history of how we've really scaled Slack if I were to put it into one sentence, is broadcast and squared to understanding who are the users and who are the channels that you talk to or interact with the most and ensuring that you have access, readily access, fast access to the data about those people in those channels.

Organizational Challenges

That's a very brief flavor of some of the technical challenges. I'll talk a little bit more about them in a second, but there were fascinating organizational challenges as well. Slack is a product and design-led company, we are an enterprise company. Why does that matter? Infrastructure, I wanted to build an engineering-led organization in a product-led company that is very difficult and heroin.

Why does that matter? On the beginning when the infrastructure was seeing its age, we wanted to invest. What happens when we make improvements and things start working? How do we ensure we can still get things like headcount, like budget that we can continue to still invest in infrastructure when suddenly it just works again? Let me tell you, for every phase of growth, something strange will break and it won't just work again.

I grew up in New Mexico and so I like to use pictures from New Mexico. We're going to walk through a visualization of what it felt like when I was building that org from an organizational standpoint. I inherited Infrastructure and you can imagine, I needed to build, we wanted to send data, we needed to walk away from where you are right now to that picnic area over there. At first, people were, “Of course we need to invest, we need to make sure that we can move data, that we can send people across that bridge. Of course, you'll get headcount. Of course, you'll get budget, absolutely.”

We start building and building, and then we build this really beautiful bridge. People are walking across it and it's awesome and they're, “Oh, it's just there, it works. Great job.” Now we're going to divert our resources over here. Then the next week someone comes to me and they say, ''I'm going to drive my car across this bridge. It's going to work, right?'' I'm thinking, absolutely not, but they're, ''Well, ok, good to know, but it's going to happen anyway.''

I'm thinking to myself, we need to react to that. Unfortunately, these are the early days to Infrastructure, we're reacting, we're not being proactive quite yet. I'm, "Ok, we better put some supports on the bridge, we're going to need to divert some of our prioritization and our effort over to supporting this thing." Then a month later someone's, ''Yes, I've got this semi and we're going to drive it across the bridge.'' Then, of course, I'm, ''Ok, the bridge is going to collapse. We're going to have catastrophic failure if someone drives a semi across the bridge. You can drive your car as long as it's one of those European small cars, but you cannot drive your semi across. Let's talk about how you could drive your semi across in 2017.'' When the bridge is working perfectly and people are walking, you don't even think about it. When it's collapsing it, it seems like that is the mantra of infrastructure.

Evangelism

One of the things I wanted to do to combat this notion of at what moment is someone going to tell me they're going to drive a semi over my walkway is this notion of evangelism. In engineering, if I were to get up here and say the word sales, that's a really dirty word in engineering - I only say evangelism. I started evangelism within the organization from day one about what is infrastructure, why does it matter? We're building these bridges, who cares? Well, you should care and this is why you should care. We were a new organization in an existing company growing at a rapid pace. One of the first things that the team really wanted to do was invent all-new process.

Think about it from your perspective. You're an engineer, you're joining an organization that doesn't have product managers, that doesn't have designers, doesn't even have program managers. You come over and you're, "Yes, I can do whatever I want and it's going to be awesome and we don't even have to use Jira."

One of the earliest conversations I had with a very early engineer at the company was like, "We don't even have to have a roadmap. Let's just build whatever we want." As part of that discussion, this engineer was very early at the company and he was, ''We didn't have a roadmap three years ago.'' I was, ''Well, there were 10 of you, now there's 200 of us.'' I didn't want to go completely rogue and not have any planning and not use any process, but we needed to have some loose process. The reason for that is the company had already centralized on a certain toolset. I decided we have to use some of these tools. We can't just throw out the baby with the bathwater.

Finally, if any of you are in management or in senior or technical leaders of positions, you get exposed to the inner workings of headcount. I've talked a little bit about head before. I needed to have an executive who would sit at the table, who would often go for bat and deeply understand what we were doing, even though I was going on my sales evangelism tour so that when the rubber hit the road and the person was going to drive the semi, I could actually, if I needed to, address some of it with headcounts. I identified that executive and I made sure that executive was very bought into what we were doing.

Phase 2: 2017

By the end of 2016, we had formed the organization, we were about 20 people, the presence ticking time bomb was solved. We started to learn how to be a team, and how to get stuff done, how to ship software as a team, a new team in a larger organization. Things were going ok, still a little bit of hair on fire.

Now it’s 2017, let's talk a little bit about technology and some of the technology challenges that we are facing. We ran and continue to run a PHP monolith. We were in the position of transitioning to Hack, which came out of Facebook. We hired some folks out of Facebook to help us with that transition. We were also in the midst of transitioning to the HHVM virtual machine. We were very much where our culture is rooted in, you need to deeply understand what you're building and how it scales. We were still adverse to external libraries because when you're operating at really intense scale, this large web-scale, you need to know how to debug it when it breaks because it's going to break.

In the infrastructure organization, I inherited the one team that was building a service, so we don't have microservices. We don't really have a service-oriented architecture. We are interested in going in that direction in 2017, but it was really early. I inherited one of the teams that was building the present service, which was written in Java. We were experimenting with Go at the time, we were running into some interesting performance limitations outside of the United States and we really wanted to build our own caching tier and we decided Go would be the right language for that. That was also going to be the second service that we had ever built. These were very bespoke services, we didn't have patterns, we didn't really know how we were going to route traffic. We didn't have common rate-limiting across these services. It's very early days.

We are also in the midst of changing how we sharded data. We use a very heavily sharded MySQL. We had sharded by team and there was really beautiful benefits, so team might be a workspace. It's your company, your company is on a Shard. In the beginning that was great, you had failure isolation, it was really easy to know, “Oh that organization maps to that shard.” As we grew and grew, we had really bad hotspots of certain shards that would get particularly hot Monday mornings when that whole organization would come online. Then we had this really long tail where almost all of our shards had very little utilization, but then a few shards were constantly on fire. We decided that we were going to transition.

This is a multiyear effort to Vitess, Vitess is a technology that came out of YouTube. It's how they scaled MySQL. Instead of sharding by team, we were going to change how we sharded all the data. It is very easy to say that, it is a very complex and nuanced process to make that transition. We can talk more about it in another time, I can also point you to a talk that a colleague of mine gave at QConSF all about the transition because we could talk for days about this.

How Our Technology Stack

I mentioned we've got services, we're starting to think about our database, sharding how our database topology. This is a very simplified diagram of like how our technology stack, we're missing so many other pieces. We got this monolith, we've got these services we're, “Oh, infrastructure, we'll just cut the line there.” We'll learn everything below the line and then the monolith will be owned by product engineering and there's hundreds of engineers in product engineering and they're all committing to the monolith. Here's where it gets tricky. Who owns the bottom of the monolith? Have any of you run into this sort of a challenge before? I see a few heads nodding.

This is why this is hard. You've got these API layers and some of our services have SLA, some of them we're still figuring out what the SLAs are, but who are the Infra-minded folks developing in the bottom of the monolith? We're changing our sharding strategy. There's a lot of routing logic in the bottom of the monolith. Who owns that routing logic? Where should that routing logic live? The boundaries were really messy and it took us quite a long time to figure out who should be developing in the bottom of the monolith and who should be on the other side of those services. We're not in a microservices architecture and this is incredibly complex, especially when you only have maybe 20 people developing underneath that line and you've got hundreds and hundreds on the top writing queries, caching data using these services.

Communication Risk

We also encountered a lot of what I like to call communication risk. In reflecting back on my career, I think this is one of the most underestimated and incredibly complex risks that most engineers and engineering technical leaders, managers don't take into account, and it bites us every single time. Usually, we want to work on hard, interesting technical problems of scale. Most companies we advertise, we've got these hard problems, they're really fun, you're going to be on the bleeding edge, you're going to be developing exciting technologies, but when you're actually developing these technologies, and you're writing out these documents about how they work and more importantly communicating with other parts of the organization about how to build on some of these foundational, exciting infrastructure technologies, you have a lot of communication risk. We would name our projects these funny names and nobody would know what they were or how they worked. Actually, this got worse over time where people would want to treat, in some ways, what we were doing as a black box.

From an engineering standpoint, you're, “I don't care about how your service works. Show me your APIs, give me a notion of your SLAs, I don't care.” In reality, especially when you've got a monolith and you're very early in developing services, it doesn't work that way. What would often happen is we would want to change how we do caching or how we do database routing. We had to communicate that throughout the organization. We didn't want to just show up one day with this big bang moment of, here's how it works, use it.

We had to have this whole communication discussion along the way. How were we selling and evangelizing what we're doing? Why does it matter? Who cares? It’s about being able to really communicate that clearly. I think that there's a lot more communication risk and infrastructure than I've seen in other parts in the organization. I've seen us try to tackle really complex bugs that when someone says, "Hey, what's up with that bug?" and then you talk to them for 30 minutes, that's when you know you have some sort of communication risk. We didn't have a quick one sentence answer for what it was that we were doing.

Hiring

Finally, the last big challenge of 2017 was hiring. How many of you feel like you're under pressure to hire all the time? Only a third of you, so that two-thirds of you - that's awesome. In 2016, the group was really small and there were about 10 engineers. The first thing that all the engineers came to tell me was, ''Oh my gosh, we're totally underwater. You need to hire a bunch of people, like yesterday.'' The CTO was, ''You need to scale this group, here's your headcount, you better use it, you need to scale.'' Then, who do we need to hire? What are the skill sets? The currency of managers is often says how large an organization is and that can be reflective. That can be a good measure and it can be a very dangerous measure.

People are, "Well, how are you growing? What's your plan?" I had to get all the engineers together in a room and we had to have a conversation about, what are the skills we value in infrastructure engineers? How are we going to assess those skills? At Slack, at the time, we didn't know how to hire infrastructure engineers. We'd hired Android, we'd hire iOS, we knew folks who knew how to do great front end work, your general backend work. After a lot of very difficult conversations, we decided on a coding exercise. I'm happy to talk more about this offline, about why we decided and the pros and cons of coding exercises.

We had an exercise and I've listed some of the actual criteria that we aligned on internally of how do we want to hire folks and how do we want to measure. It was fun, as we grew and as we hired, there were always these moments where we'd have drift where someone would be thinking, "You know what, we really need more database expertise. I think we're going to have a spoff over here on how we scale this part of the database.” Subconsciously, that person would start interviewing people more for database experience and then I'd have to come back and I'd say, ''You're not aligned with the rest of the people interviewing that person. You're judging them on a criteria that other people don't understand.'' We really had to be laser-focused on how we hire. There's a lot there around how do you eliminate unconscious bias? We had to be laser-focused on a whole other talk about that.

This brings us to the end of 2017, it was another hectic year, I was doing a lot at that time. As I mentioned before, when we first started infrastructure, we didn't have product managers and the engineers love this and they were, “We're going to have the time of our lives. No one's going to tell me what to do.” That's not an actuality of how it works. Then what happens when you have a product-led company and you don't have product managers is that there's a whole lot of valuable and important product process, like product thinking that goes on in the organization. What would happen was the engineers would say like, "We should not hire any PMs in Infra, no PMs ever.” Then a group would come to me and they'd say, ''Hey, we want to run the semi across the bridge. What do you think?'' I'd say, ''Thank you for coming to us.'' I'd send an engineer to go talk to the product manager and both people were really displeased in that arrangement. The good thing is the engineers in infrastructure started to gain a lot of awareness of what was going on in product engineering, but product engineering is a very large organization. At one point, I did decide, I hired a head of infrastructure product manager, and I remember the engineer saying, "What are they going to do all day?" The question is really, "Are they going to boss me around?" I'm, "No, no one's going to boss you around." I was, "Remember that meeting you went to where five people all talked about driving a semi across the bridge?" The PM was going to go to that meeting, they're going to figure out what are the prioritization, how do we work? What are the requirements? They're going to engage in a discussion with us. They're going to go to that meeting. Then the engineer is, ''Oh, we need to hire that person immediately.''

When I've talked to folks about infrastructure and product management, this leads to a lot of other nuanced and interesting questions that I'm not going to dig too far into today because I think it depends a lot on your organization and what you want to accomplish and the type of company you are and who leads your company. In enterprise we have a very large sales organization and I had to actually start interfacing a lot with our sales organization. When you think about infrastructure, infrastructure feels like it's on another planet from enterprise sales. In actuality, if you're going to have an enterprise customer who runs their business on your product, they often want to have a high degree of understanding and confidence in the infrastructure and how it works. They know that you can keep the product up, but it can scale with their usage. Suddenly, these worlds started coming together, our product manager would work closely with large customers, with sales and eventually I hired other functions as well.

Phase 2: 2019

That brings us to 2018. 2018 was really this moment when I took this step back where we had grown over the past couple of years from about 10 engineers to now nearly a hundred and we were out of this zero to one phase. There really wasn't much in 2016 and now we were going from one to infinity. We were a large org, we were in three offices and we weren't just this motley crew. We were an actual organization and we had to have much more communication and process, but I mean process in a good way. Process is codification of learning where we would write stuff down of how things worked and how things didn't.

It was a fascinating from myself in a leadership capacity and the principal engineers, the tech leads that I worked really closely with going from this chaos environment where we're barely building the bridge to now having a very different job at the company. Our services had matured really rapidly. When I showed that picture of the monolith and who owns the bottom of the monolith and the services below, we started to really think strategically about how do we build services, how do we enable people to build services easily and quickly. In 2016 if you wanted to build a service, it was like, “Good luck. Like, let us know if you need anything.” Now, we were, “Ok, here's the template. Here's what you need to think about.” If any of you, hopefully most of you, you search in Slack, we built search, ranking and relevancy as a service. That was this brand new third service that we were building on.

We transitioned from generalists and for generalists to now really having to hire Infra specialists. Instead of folks who just generally loved building infrastructure, we had folks who specialized, for example, in database storage. We have a very large asynchronous job queuing system. If you put a URL into Slack, hopefully with almost no latency, you see a preview of that URL. That goes to a system that was running tens of thousands and then now runs billions of jobs per day that are both short running and long running. There are folks who've thought a lot about these asynchronist systems and have built them at other companies, so really focusing on hiring folks who love that and that's their passion.

I also had to hire a manager specialist and that was really interesting. At one point, I was running what you would consider core infrastructure, our caching, our database tier, our search infrastructure, machine learning and data infrastructure. I always love to call data the gateway drug to GNA or GNA is often the finance organization because if you're the one running a lot of the data and the data pipelines, those are often consumed by a lot of folks, analysts who are understanding the nature of the business. I ended up hiring specialized leaders for each of those functions who are deep data specialists, who are deep search and machine learning specialists because those orgs really needed to grow substantially and they needed really strong leadership that could have that vision instead of, "It's broken, you need to fix it." It was, “How are we going to scale over the next 10 years?”

We built product management. That was a fascinating journey where for every product manager that we added, there was actually significant benefit. One of the biggest mistakes I made was not hiring product managers sooner. What I loved and many of you hopefully can relate to this. One of the PMs came to me one day and she was saying, “I didn't realize there'll be a lot of aspects of this job.” One of them is taking a Slack conversation and turning it into a product spec. We run the whole business on Slack and so a lot of these details get hashed out in this long conversation. The beautiful thing of that is you can go reference that conversation. The hard thing is, imagine saying to someone, what are you going to build? You're, “Well, start here in the channel and read for two days, two days of chat conversation and then you'll know what we're going to build.” That's crazy. Translate that into what we're actually going to build, working alongside the engineers.

Then finally, a big part of growing this story was identifying where are the areas where we needed to invest and we needed to invest really rapidly. One of the beautiful things about doing acquisitions is you can often have a whole team join you that can have this running start. I acquired a company in the service mesh area to help us build and rethink the future of Kubernetes and service mesh, we're moving towards Kubernetes now at Slack. I think one of the hardest things though is in the beginning when you have these small teams, you're all working together, you don't have to write too much down, although we did try to write stuff down. As your organization grows, it's hard to have coherency across those organizations.

I had this wonderful privilege of leading frontend infrastructure. These are folks writing JavaScript, writing JavaScript in the Slack client that's handling the other end of the APIs. They're, “We don't have anything in common with some of the other folks in the org.” That made all hands really fascinating and challenging because half of the team was always really engaged and the other half was asleep. It became this really difficult challenge of how do you have that coherency and maybe it's time to break those organizations up. That's what we eventually did.

Looking back, I should have done a lot more reorgs and I should've broken up a lot more parts of the organization so that they could have more specialization, but instead, it was working so we kept it all together.

Summing Up

Looking back - and since I work at Slack, I'll use emoji - 2016, when we started the group, it was chaos in a beautiful way. 2017, you're sweating, but you're smiling. We're figuring it out, it's coming together. 2018 was when we were able to transition from reacting to being proactive. We actually had built not only the walkways, but we built a lot of the highways so we knew we had projected out how many semis were going to drive across those highways, so we felt much more ready.

I was looking back and I was pulling some of the numbers of how things have grown and changed and evolved over time. It's really fascinating and I think exciting to see the types of growth across all axes. I had talked about our asynchronous job queuing infrastructure, that daily active users watching that grow, simultaneously connected users, headcount in different offices, generalists to specialists. Then, it's such an amazing team and it's been a wonderful ride and a huge privilege to be able to lead that team.

Questions and Answers

Participant 1: You mentioned that you wish you had done more reorgs. How exactly has that worked for all of your engineers? Because I know in engineering, at least in my organization, we dread reorgs.

Grace: Something that I learned and I often tell folks in the company is, if you're growing on all these axes, the only thing constant is change. How do you both hold the tension of wanting stability and wanting to understand who's on your team and who your manager is in forming that relationship with your manager, forming the relationship with your colleagues, with the fact that everything is growing much faster than you can ever imagine.

Finding that place of balance I think is really tough. I had so much empathy and I've been there myself. I've had a lot of managers over the course of this four years and I was, “We're going to really try provide the most stable environment that we can and not change things up too much.” The trick is if you're in that mindset, what often happens is there's this moment when you cross this threshold where you're, “Oh no, I should have changed it.”

I often think about this in terms of like organizational growth size where you're, “Oh my gosh, everything's broken. Nothing works.” You throw away all your process, you throw away your tools, figure out some new process, figure out some new tools, figure out some new structure and you're, "Yes, I've solved it. It's going to be awesome." Everything's great and you're, "I don't have to change it. I can solve some of the problem." Then you grow and everything breaks and everything's a disaster and all the people who've just joined the organization are, "This place is so chaotic," and you're, "But it was just working fine."

I think it's not necessarily like you don't want to introduce volatility, but you really have to be hyper aware of once you've passed those points. That might be reorging or changing the structure. Honestly, it's super hard, and I talk about this too with my technical principal, with the leaders I work with on the technical side of how do you have fluidity and that notion of allowing change and then saying, you're going to work with someone on another team and that's ok, because usually, people are, I'm going to work on my team and these are the boundaries of our team, but there's so much fluidity. How do you have both fluidity and stability and holding those things together? I don't think there's really any great answer other than prepare for change. I tried to work with the engineers who report to me, I'm, "What are the things that matter at the time where we really need to root instability and how can we be flexible on other axes."

Participant 2: This is specifically with regards to hiring and you mentioned about hiring specialists via take-home tests. While I'm a big fan of that, one thing that I think is a matter of struggle is getting people to actually be active and participating in a sense that when I take home-test sent to people, you also have other companies making offers on the fly. Getting people engaged is something that in the past I struggled with. What are your thoughts on how to incentivize, how to keep people engaged?

Grace: The question was all about hiring, take-home coding challenges. This is how I think about hiring. When you interview someone, most companies use the same interview techniques or methodology that you've experienced as an engineer. You go to show up at a company and you're, “They asked me these questions, so I should ask those questions of every single candidate.”

What happens when you just blatantly copy, and I don't necessarily say that in a pejorative sense, but if you're copying a process from another company, first of all, you're often copying a process that generally only works for some people. Interview processes were generally designed years ago by big companies who are trying to hire a certain type of engineer, and your interview process will probably need to hire a different type of engineer. How do you have a process where somebody feels comfortable and can do really great work and you can get high signal as to their ability to do the job?

Every single person probably has a different way that they can really shine in an interview process. I'll give you some examples. If you're, “All right, we're taking QCon's changing course, everyone's going to do a whiteboard coding interview in that room, so line up.” How many of you are, "Hell yes."? It looks like almost no one. That's part of the challenge now. Then if we're not excited about all lining up to do a whiteboarding coding test, how do we actually assess one's ability? There could be pair programming, there could be, do some sort of take-home assignment, there could be probably other methodologies as well. I've seen people modify this open-source framework.

What you want to do is figure out for the candidate, what is the way that you feel most comfortable that we as a company can accurately and unbiased assess. If you do too many one-offs for senior candidates because they're getting all these offers, what happens then is you're not able to compare consistently across your set. What I would suggest, so what one of the things that we ended up doing was we offered people two, it's a longer conversation. There's a lot of nuance here, but we were like, “Here's our take-home. If you don't want to do it and you're on a timeline, or if this makes you feel like you don't have the time, I totally get it. Come in and you do a longer onsite.” We offer those two options and we say, whatever makes you feel more comfortable because we want you to feel comfortable in the process.

It's interesting how many people will actually opt for the take-home when they're presented the options. We did run into this problem of, it's also very different with junior candidates to really senior candidates. You can't have a one-size-fits-all process for people coming out of college as compared to folks who've been in the industry for 20 years. The folks coming out of college are often really good with some of the whiteboard coding stuff because they've been doing this a really long time or they've been studying algorithms and it's fascinating to see really senior candidates who have this rich depth of experience but they may not be as great on the other end of these, like code up a binary search tree.

I'd really encourage you to be really thoughtful. We did it from what do we need, what are we measuring for, how are we assessing and how do we do it blindly? Also, what is the candidate pool, how do they want to be assessed and how do we think about the strengths in the candidate pool that we want to potentially hire for that may not be identified by just a strict coding test.

Participant 3: You mentioned there is a team to take care of the monolith and its features and a second team to handle the breaking into services or microservices. As it should be a single team, which should do the both of these because the team who handled them at the monolith up is the team who understand the domain and this should be the same team who break up to monolith? How do you train your engineers for this new technology, new way of architecting?

Grace: I'm not going to skirt the question, but honestly it depends. It depends upon the culture, how you structure your teams depends upon the culture. It depends upon how rapidly you're growing the organization. It depends upon the goals of the organization. Are you thinking more about scalability? Are you thinking more about reliability? They're fundamentally two separate things.

We've gone back and forth on a lot of different ways in the monolith. In the early days of Slack, you would write straight sequel, no ORMs. Now, when you're transitioning your data store to a very fundamentally different type of system, suddenly, you've got all these queries that don't make sense when you're sharding along different queries. Who changes the queries? Should it be infrastructure, should it be engineers in the modeling? There's a lot of pros and cons there. On one hand, everybody in the monolith needs to learn how to write these new queries. It should be those engineers, but they're, “Sure, great, but how are you going to teach us how to do it? What are the templates? How are we going to make this transition?

We started doing a lot of query modification, but it's honestly rooted in, and this is communication and education, and how do you communicate when you're making these really big changes that will ultimately have significant downstream impact in the monolith. There is no canonical answer of, it's got to be this team or it can never be that team because it depends what you want to accomplish and how large your org is and how you're going to roll out training and communication about how things are going to change and how you're going to do change management. The one word I never spent enough time thinking about as an engineer and now I spend all my time thinking about, how do we do change management and how do we roll out those changes.

See more presentations with transcripts

Recorded at:

Jul 23, 2019

Julia Grace

InfoQ Software Architects' Newsletter