BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Scaling Slack - The Good, the Unexpected, and the Road Ahead

Scaling Slack - The Good, the Unexpected, and the Road Ahead

Bookmarks
37:53

Summary

Mike Demmer talks about the major changes that Slack has made to the service architecture to meet the needs for larger and larger enterprise customers. Demmer presents 3 of these changes: decomposition of the real-time message service, client-side lazy loading via edge caching, and scaling the primary data storage tier with fine-grained horizontal sharding using Vitess.

Bio

Mike Demmer is a member of Slack's Infrastructure Engineering team, where he works on hard problems of scalability and reliability and leads the development of Slack's next generation database architecture. Previously, he was co-founder and CTO of Jut - a startup applying a new dataflow language to observability for developers and operations engineers.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

I'm going to talk about a little bit of some of the transitions that Slack's gone through over the last couple of years. Just by way of introduction, this was the nice welcome post that my manager, Julia, posted on my first day at Slack just about two years ago, when I joined the infrastructure and scalability team. Before this, I spent a bunch of time doing various startups in other companies, mostly in the networking space, and spent some years at Berkeley doing a PhD that's got very little to do with anything I'm going to talk to you about right now. And this is not a talk about a transition from monolith to microservices. So, there were some really, really great talks that happened earlier about this theme. And so, I encourage you all to, if you didn't see it, catch the video afterwards. What this is a talk about, is really focusing on what's changed in Slack in this past couple of years.

So to start framing the conversation, we'll go back in time a little bit. I'll lay out the basic architecture of the Slack service in 2016. Motivate a little bit of why we made these changes in light of new requirements and new scaling challenges. Most of the talk will be about what we did about it, and then I'll sum up with some themes that came out of that work. And talk a little bit about what we have in front of us.

Slack in 2016

All right, so let's all get into a time machine and go back to 2016. I've just joined this company. I'm really excited to start. And I go through this presentation about how Slack works at the time. I'm not going to do the thing where I ask everyone to raise your hand if you use Slack. But if you don't know Slack, it is a messaging communication hub for teams. People can communicate in real time with rich messaging features in order to get their work done. There are lots of integrations, there are plenty of other features in the product. And I just want to point out right now, I'm mostly going to be, pretty much actually all going to be focused, on the core messaging features of the product, search, calls. There are all these other parts of the system that I'm just not going to talk about that have their own interesting tales to them. Really, this is just about the messaging core of the application.

So, it's important to note that Slack has organized around this concept of a workspace. When users log into the service, you log into a particular domain that relates to your company or an interest group, and then all the communication, all the user profiles, all the interactions, really live within this kind of bubble that we call a Slack Workspace. And this is a really important element of the architecture at the time.

And so back in 2016, we had about 4 million daily active users. People stayed connected to the service for much of the workday. A large number of our users would come in, open their laptop in the morning, and have Slack open for really the duration of the day before shutting down at night. And so our peak connected users, globally, were about 2.5 million peak at that time. And those sessions are really long, 10 hours a day, where people would stay connected to the service. At the time we had organizations of around 10,000+ users, the biggest that were using the product at that time.

And really there was this culture of a very pragmatic and conservative engineering style. And you'll see that reflected in the architecture of how the system worked. Which really can be drawn up in five really simple boxes, on this diagram here. Client applications made lots of their interactions with the big PHP monolith. That was hosted in Amazon's U.S. East Data Center. In fact most of the site ran out of U.S. East. Still does actually. Backed by MySQL database tier. There's an asynchronous job queue system which has had its own share of scalability and performance challenges, but I'm actually not going to talk much about that during this talk. But really, that was where we would defer activities like unfurling links, or indexing messages for search; would get pushed out to that tier.

And then in the yellowish boxes on the top was the real-time message stock. So, in addition to having a pretty traditional web app with all the letters of LAMP in there, Linux, Apache, MySQL, and PHP, all present and happy. We had this real-time message bus that was custom software built in a Java tier that is where all of the pub/sub distribution of the messaging product happened. The only thing that didn't run in us-East-1 was this thing we called the message proxy at the time, that was really just responsible for edge terminating SSL and proxying that WebSocket connection over the WAN back to U.S. East.

It's a pretty simple model, right? Five boxes to describe all of Slack. And that was reflected in a little bit of how the client and server interactions worked. So you stand up, the client, you open it in the morning, and the first thing it does is make this API call rtm.start to the PHP backend, which would download the entirety of the user model, sorry, the entire workspace model. So, every user, all of their preferences, their profiles and avatars, every channel, all the information about them, who is in every channel; all of that would get dumped to the client and the client would populate a pretty fat model and vision of the metadata of the team. And then as part of that initial payload, we also got a URL, which the client will then use to connect to a WebSocket to the nearby message proxy that would then go back to the message server.

And so, once you're connected to the service, you're using it for 10 hours a day. Pushes would come over that WebSocket. Messages would arrive, a user might change their avatar and you'd see the image show up and immediately update. And so both the data plane and control plane messages were sent over this real-time message service to keep the client up to date. The backend was scaled out horizontally by some pretty straightforward sharding approach, where when we create a new workspace in Slack, we assign that workspace to a particular database shard and to a particular message server shard. And that was pretty much it, right? You log in, the first API action you would do would look up the metadata row from this database tier that we called the Mains, the key initial bootstrapping tier. That response would go back to the PHP tier, and then it would know which of the database shards and which of the MS Shards to route this request to. And then from that point on, pretty much all interactions for that user, or actually for all of the users in that workspace, would be confined to that one data shard and that one MS shard.

Now, at the time, we managed our servers in this mean that I'm sort of calling a herd of pets. People have heard this cattle vs pets metaphor. We really had a lot of these individual pets. Each database server and message server was known by a number, and lots of manual operations were needed to keep that service tier up-to-date. The PHP code, the monolith, would have a big mapping that would say, "Database shard number 35; here is the hostname that that is currently running on." And there were, like I said, manual processes to keep that up-to-date.

The one exception of that is that we ran our databases in this slightly non-standard way, where we had two MySQL hosts running as active/active, master/master replication between them. Meaning each side was available for writes, and they would replicate across them, which is a particular design choice. The real reason we did this- we were not dummies. We did this as a pretty pragmatic approach to get site availability with the obvious caveat that we would sacrifice some consistency. So, every workspace, in addition to having one particular numbered pair of MySQL hosts, also had a preferred side that it would try to read and write to. And so when everything was up and functioning, every team, every workspace, would be pinned to one database host, with the other side as really just the backup for replication. If anything goes bad though, the application would automatically fail over to the other side. And yes, that did produce some conflict problems, and there were some challenges along the way. But for the most part, this model basically worked and the site was able to scale accordingly.

And really there's two dimensions for why I think this really worked well for Slack in 2016. The first is that, overall client model really enabled a very rich real-time feel to the product that I think was a huge part of why Slack was so successful. It really just felt like everything happened right away after users did it. Messages appeared, reactions appeared. It really was a key element of making the feel of the messaging product that rich experience, even when it came to all of these metadata updates.

And then for the backend, we didn't have a huge engineering team, and five boxes is a pretty simple model to keep in your head about how things worked. And if a particular customer was having a bad day, you could figure out very quickly which was the database host I needed to go look at? What was the MS message server host that might be on fire? The debugging and troubleshooting was- you only had three places to look at for what the problem would be. And so that really helped the team focus on building the product, and building the user experience that I think was such a key part of why Slack was so successful.

Things Get More Interesting

Cool. But this is not 2016, this is now 2018. And in the years between them, there were some challenges that we ran into, which is why that model no longer worked very well for us. There's really two dimensions to this. One was, the increase in scale for the site, increase of usage, and then some pretty fundamental changes in the product model.

So, first, scale. As you see here, we had great growth. That growth has continued to the point that in 2018 we now have more than 8 million daily active users for the product, more than 7 million of which are connected at any one time. This is a peak right off the presses, slightly, yes. Anyway. So, that's a lot of people connected to the service. And we're hosting organizations of more than 125,000 users. So, we've doubled our user base and we've more than 10 times increased the size of the largest organizations that are using Slack. And as a result of many of these scaling challenges, as I'll know more in this talk, we've adapted our engineering style. We're a bigger team. We're a little more ambitious in the types of solutions that we're willing to embrace when it comes to complexity in service of some of these big challenges.

Now, the product model I mentioned before, where there was this nice tidy workspace bucketing for all of the interactions in Slack, that also started to break down. And this was really two phases. The first was the introduction of the enterprise product of Slack. I'm not sure how many people are aware of this, but, for large organizations, we offer a version of Slack in which you can actually have separate Slack workspaces for each subdivision, or geolocation, or other organizational unit within a larger enterprise, so that local teams can have their control over their channels, and their preferences, and which integrations are added to it, but these are all part of an umbrella organization. And so, Wayne Enterprises can have announcements global channels where the CEO can write out long messages that everybody in the company can see. But then, Wayne Security has their own control over their fiefdom. But this means that each of those workspaces has its own notion in the backend, has its own database shards, its own MS shards. And so any given user now no longer belongs to just one nice organizational bucket in the backend; we now have these different contexts to consider. Is this an organization-wide channel or is this a local channel?

And then things get even more interesting with this feature called shared channels that's currently in beta, where you can have two completely separate organizational entities with their own management, their own billing, and they can create a shared channel between them for collaboration. This is really heavily used by, for example, design agencies, or PR agencies, that want to have relationships with multiple customers, where those customers are also using Slack, and they can communicate within their home Slack or talk to people in the other company.

Well, this really fundamentally blew up a lot of the assumptions that that original architecture was written to service, because now channels no longer belong necessarily to one workspace, but they're now sharing cases in which we have to consider, "How do I do the architectural model to both take into account the bigger scale and some of these new kind of shifts in the landscape?"

So, like I said, that Cross-Workspace Channel was one challenge. And then just dealing with the scale. You know, we had all the great war stories that Brandy alluded to, where an action would happen across a large number of customers for one workspace, and all the client applications would come back and then thunder into the backend to hammer it for some data. As we got to bigger and bigger organizations, that boot payload that I mentioned before, got bigger and bigger. And so we're starting to have challenges of just getting clients connected to the site. And then this herd effects, it works fine when you have 20 servers. It starts to get bad when you have 100. As you're getting into hundreds and hundreds of servers; just the manual toil it was taking to keep the site up was just draining on our service engineering and operations teams.

So What Did We Do?

All right. So now let's turn into a little bit of what we did about it. And really I'm going to talk about three separate changes to the service that were really in service of a number of these challenges. The first topic I'll talk about is a change to the client-server model enabled by a caching service called Flannel that allowed us to really slim down that initial boot payload. And so, if we take a little step back- I did some sporanking in our data warehouse. The original rtm start payload was a function of the number of users in the team with the workspace, times how big their profile was, how many channels there are, how big the topic and purpose and other metadata of the channel is. And then, how many people are in that channel times the size of the user ID. And I just picked three random teams, 300 workspaces, sorry, with different numbers of users. And you see about 4,000 users, this is a reasonable 7MB payload, it's not great. As you start to get up to tens of thousands or a hundred and something thousand users, a payload gets to be enormous. It gets to be basically prohibitively expensive for every laptop or iPhone to pull that whole payload down whenever a user wants to get onto the site. And let's remember that that's per user. So, if I have 100,000 users each pulling down 100 plus megabytes of data, that's just going to hammer this, and it's not really going to work.

And so the biggest change we did to address that, was enabling that initial boot payload to be much, much smaller. And we did that by introducing an edge caching service that we call Flannel, that's deployed throughout, point of presence, all over the globe, close to clients. And then we changed the WebSocket connection flow, so that after making an initial thin connection to the PHP application, we then connect the WebSocket to a nearby Flannel Cache which proxies the connection to the message proxy and then to that route pub/sub message server.

So, as I mentioned, Flannel is a globally distributed edge cache. When clients connect, there's a routing tier in each edge pub/sub that routes a client with workspace affinity to a Flannel Cache that is likely to have the model for that workspace. We didn't want to get in a situation which Flannel had to have all metadata about every customer in Slack because that's clearly not at scale itself. So you find a Flannel that's nearby that has the model for your workspace.

Then because it's a proxy for that WebSocket connection, it's actually in line and able to man-in-the-middle the updates that occur for the users in that workspace. And so, that allowed us to take all of those user profiles, all of that channel information, out of the initial boot payload. So the things that were making that boot payload huge, we just removed them altogether and then exposed the callback query API so the clients could, with very low latency, fetch unknown information from a nearby flannel.

And so, that allowed us to maintain that really real-time feel to the product, where every bit of information you see on the screen actually shows up immediately and it feels like it's instantly there, but without having to download all of it to the client to be pre-staged there, we allow this really rapid query response API to an edge located flannel very close to the users, that was able to service those queries. And this was hugely successful in allowing us to scale the product to these larger and larger organizations.

Oh, I wanted to mention one thing back in here that I thought was kind of interesting. When we were rolling out Flannel, both are backend server refactor and then a huge client rewrite. The clients had been written from day one to expect that the whole model was resident in memory. And so, if you think about taking a large JavaScript or other kinds of Swift and Android code base, now they have to be prepared to make callbacks if an object doesn't exist. And so we did this in phases. It obviously took cooperation, a huge uplift from the client teams and the Flannel teams to enable this.

One little interesting approach we took as part of the initial rollout was, because Flannel sat as a man in the middle between the backend and frontend, and it had this in-memory model of all the users, it actually had a road initially where it would sniff the WebSocket messages that were coming across the wire to a given client, and if it saw a user ID or a channel ID that it thought the client didn't have, it would just inject that metadata in-line to that WebSocket stream. And so it was ready in the model right before the client would want it, while we went through this long slog of rewriting the clients to expect objects to not necessarily exist and be able to run this callback API. But with these techniques in place and the long hard work of all the engineers, we were able to build this Flannel-based metadata model that really, really enabled the site to scale to these larger and larger customers and organizations.

Right. So the next big change we made is to the database Slack. So, from the WebSocket Slack to the backend. And this was a really big fundamental rethinking of how we wanted to do our database sharding. And this is stemming mostly from the challenges that we had from continued database hotspots due to the co-locating all of the users and all the channels for our organization onto a single database shard. So I just pulled up some of the redacted postmortem titles from our GitHub postmortem repository, where shards were being overwhelmed with queries, or new features would get rolled out, and they would not always load tested to the best that they could. And so the load would come in and it would just hammer these database shards.

The graph on the left is actually the top 100 databases in terms of daily query time that I just pulled from a random thing back in history. And you see, on that particular day, one shard was more than, I don't know, 5 to 10 times as busy as all the rest of them. And this would just keep happening. Someone would do something on an organization that we didn't expect. There'd be provisional a whole bunch of new users, or delete a bunch of channels; these unexpected feedback loops would occur, where all the clients would react to that and send all those loads to the backend that would just overwhelm our servers.

And so we knew we needed to do something about this. And the observation we had at the time was that this workspace scoping was really useful to keep things nice and tidy when you had lots of small organizations using Slack because the load would spread out. But as we started to get bigger and bigger single organizations, that started to be a challenge because we were funneling all of the activity from all these users on all these channels down to this one database shard. And so what we wanted to do was change our sharding approach so that you can shard by much more fine-grained objects; either users, or channels, and so we'd be able to spread that load out more efficiently to our big database fleet of backends.

And so, in order to do that, we introduced a new database tier and database technology to our system, that's called Vitess. And so Vitess was rolled out as a second fully-featured data pipeline for the application to use. The way it works is that it runs MySQL at the core. And so it is a sharding and topology management solution that sits on top of MySQL. In Vitess, the application connects to a routing tier called VtGate. VtGate knows the set of tables that are in the configuration. It knows which set of backend servers are hosting those tables, and it knows which column a given table is sharded by.

And so that allowed us to configure the system such that users/tables would be sharded by the user ID. Channels/tables can be sharded by channel ID, workspace by workspace ID. And all of that knowledge is no longer in our PHP code. That knowledge is now outsourced to the application. And from the PHP code standpoint, they query VtGate using the MySQL protocol as if there's one giant database that has all the tables. And that VtGate is responsible for disseminating and doing all of the routing. Where necessary, it can scatter and gather requests from multiple shards, if a given query doesn't necessarily get routed to one shard. And so that allowed us to efficiently and effectively roll out the policy we wanted, where data tables would be spread among multiple servers based on the particulars of that table without you needing to do a bunch of application work.

Now, along the way, Vitess also improved a lot our consistency story and our topology management. So, that idea of the active/active, master/master that served us really well, no longer applies here. In Vitess, we have a single writable master and we rely on orchestrator, an open-source project out of GitHub to manage the fail. And then to mention at the start: Vitess was originally invented in YouTube as their approach to sharding MySQL and scaling that backend. And we've taken that open-source project and adopted it for Slack's use.

Now, with Vitess in place, we've been able to really effectively scale that data tier. And so we now get the best of both worlds, where we are still using MySQL, which we have a lot of know-how and a lot of confidence in, but we've outsourced that topology management into the Vitess system, and the sharding decision and how data is allocated also into Vitess, which enabled us to make these transitions, and thereby eliminating a lot of the hotspots that we had in our databases.

Now, one caveat about this that I wanted to mention is that, the migration from our existing system to Vitess, it's been a little bit slower than I would've liked actually. In many ways, this is somewhat fundamental, because we're changing not only how the data is stored, but the particulars of how an application wants to use that data. We've just run into the need to go very slowly and carefully to maintain the availability of the site. And so, although we are underway and Vitess has definitely been proven as a really solid database platform to build upon, I think many of us had hopes that this channel sharding project might've gotten a little bit faster. We're making good progress. I have high confidence that it will succeed. This is actually the project I've been leading the most in Slack, so I have a lot of personal stake in this. And so, it has been successful, like I said, in maintaining and eliminating a lot of those hotspots that we had before.

So, I know I promised that this talk was not about a transition to microservices. And I only slightly lied. Because there is a part of our stack that needed to change. And that is this real-time messaging service. And really this was motivated by the need to handle these shared channels. The original message server architecture was very fundamentally built around the idea that a single message server had all of the data around, for a given workspace. And so it managed all of the pub/sub interactions for every channel, all of the metadata updates for every user. And that just isn't going to fly in a model where a channel no longer belongs solely to one workspace. And so we needed to do something about it. We can send it a lot of things. We could have replicated the shared channel data between the cooperating message service. But, the team got together and decided that this was actually a case in which decomposing this monolith into a bunch of services did make sense.

And so in this model, we've refactored that message server system into five main cooperating services. I feel a little bit like I can't call this microservice when there's only five of them, but there are five and they all do deserve a separate purpose. And this is how they fit into the architecture. Users connect their WebSocket to a process that we now call the Gateway server, this replaces that message proxy box that I had up there before. The Gateway Server manages subscriptions to a number of channel servers that are responsible for the call, core, and publish subscribe system. An administrative service tier is responsible for maintaining the cluster topology. And then there's a separate Presence Server tier that's really just responsible for distributing in four main circles. I'll say that again right here. Each of these services have their own bespoke roles. It turns out we had to keep that legacy message server tier around. I'll talk about this in a little bit. But there's a feature whereby we could schedule messages for broadcast in the future that was used by some of our integrations, that turned out to not actually have a great home in our service decomposition. So we just kept that in the old tier, that it's just running many, many fewer of them, and they're actually much less critical to the operation of the site.

And so, with this refactoring, we were able to change the fundamental model in this message server system by decomposing things not by workspace, but by channel. And so the overall pub/sub system is now a very generic system, in which everything is a channel in the system. And that is not just the channels that contain messages in Slack, but every user has their own channel ID. And so if you make an update to your profile, that gets fanned out according to a subscription for a given user ID, every workspace has profile information that has its own fan-out for cross-workspace information. And so the service has become much more generic and therefore much simpler to reason about. Clients can subscribe to the objects of their interest using this much more generic and decomposed system.

And so, fundamentally, this has really let us adapt our real-time message system to accommodate some of these more shared channel use cases. Again, just like before, we subscribe to updates for all the relevant objects, we've decomposed the service into clear responsibilities. This was not to then- see those comments before. This actually was a microservice refactor in the service of computer science, and not for developer velocity. It's actually the same development team that works on all these, but this was really a goal of refactoring this to enable the architecture to better meet the needs of the site, as opposed to anything around independent development.

Now, one thing that emerged from this is that a given user no longer depends on just one message server being available. In fact, I as a user might be subscribed to tens, hundreds of channels, hundreds more user channels. And so, one of the elements of this that ended up happening was the failures in the service ended up being much more widespread. And so we've designed some additional systems to ring-fence some of the cascading failures, because now each user is dependent on many more of the backend servers. So we lost a little bit of a failure isolation that we had very naturally in the workspace sharding model. I'll talk about that a little bit later when we get to some of the themes in these changes. But overall, these three fundamental changes to the Slack architecture have really helped us and enabled us to scale to the needs of our site right now, both from the standpoint of scale, as well as the standpoint of the user requirements on the product.

Some Themes

So, looking at some of these three things, there were a few things that I thought would be worth calling out as cross-cutting aspects on these changes. The first was that we've really moved away from this concept of a herd of pets. In all of these systems actually, we no longer have handed in servers that are individually replaced by humans when they fail. And instead, the services that contribute to this overall architecture is self-register in a Service mesh. Other services are responsible for discovering them, and that's really enabled our operations team to have much less manual toil. On the other hand, we now have a much harder dependency on our service discovery and our distribution system. We've used Consul. No comment. We've had some challenges with Consul though we're working through it. But really it's been a big change in the way that we think about the dependency tree for the site, it has this much bigger need on this topology management and service discovery system.

And then Holly Allen, in her talk yesterday about on-call on Slack, really touched on this third bullet. If you didn't see her talk, I encourage you to watch the video. But when we introduced these new architectural components, as the engineering team grew, this actually was a big contributing factor to how we had to change the concept of service ownership and on-call in Slack. Before, those five boxes were really easy to think about. So anybody can jump in and debug and diagnose the system. With all of this refactoring and these more complicated services, we actually needed much more specialized knowledge to both isolate and then debug and diagnose the problems.

And so the engineering culture and the concept of service ownership really followed the architecture and design. It's almost the inverse of Conway's law, where we did this thing and then we needed to adapt our architectural organizational design to match the architecture instead of vice versa.

Now, I mentioned before how this fine-grained sharding really did help. It helped alleviate a ton of the load off our database servers. But we really did start to get into some of these challenges when there were operations that no longer really routed to a single shard. I alluded to this before about the subscriptions that a given user has for all of the channels and all of the users in their organization, but it applies also to the database tier. There are certain cases in which you actually do need to get information about a whole bunch of channels, for example, to render that sidebar when you are logging, it's a complicated thing for Slack to do. You have to figure out all the channels you're in, what the last message you read in that channel, and what the last message that was written in that channel. And there's no right way to shard that in a way that's both scoped to you, what am I subscribed to, and scoped to the channel, what has everyone written in that channel? And so we've had to adapt the site and adapt the service to accommodate the fact that, as part of login, I might need to get a whole bunch of data from a bunch of different shards.

One other person we've done for that, is actually relaxing some of the consistency requirements and adapting the clients to accept partial results. And so that means that when the client goes to fetch that information, it can bound how much time it's willing to wait for a new shard. And if it exceeds that threshold, we'll return whatever information we can to the client and trust that they will call back again later to fill in the gaps on that missing information. And so this theme of being able to prepare for a potential unavailability, has really helped us still accommodate that scatter/gather failure use case without making every client and every action depend on every backend server, which is pretty bad for resiliency, and reliability, and also for performance, because the slowest shard is always the slowest shard. And so if I have to talk to 100, then all of my operations are as slow as the 100th slowest shard.

Now, one thing that hadn't really occurred to me before making the slide deck, is just how few boxes I actually removed from the architectural diagram that I showed to you before. Now, with only one exception, everything that was on there in 2016 is actually still part of our production Slack today. There's a bunch of reasons for this. Legacy clients are one reason. Their clients would still connect, not through the flannel path because they're either a third-party integration or some old Windows Phone that's sitting out there somewhere that we still support. We do support Windows Phone, I'm not joking.

And then, I mentioned this before about the database migrations. Those in the spin, they take time. We have terabytes and terabytes of data, high QPS. And so making sure that migration can be done safely, where you're trying to both reshard the data, change the data model, and in many cases, we've suffered from a little bit of a second system defect, where we have a table, it works one way, we have a long tail of things we've always wanted to do on that table, boy, wouldn't it be nice to couple those new features with the data migration we have to do right now anyway. And so that's actually caused a little bit of tension in how aggressively we move the data off those legacy shards into this new system.

The other thing I that was not cool, really important to find out is how these really grand architectural refactors are really useful. They were essential for us to be able to manage the growth of the site. But, there have also been hundreds of little supercritical not so glamorous performance changes that we've done in the backend system as we've seen various hotspots come through. So, again, it's true that we needed a lot of these big refactors, but, things like adding additional caching, removing expensive database queries, those have been hugely important to getting this parade of, "Oh my God, things are terrible," to, "Oh, great. Somebody fixed it, yay", kind of things.

One pattern that has been really important to us is adding jitter to our client operations to avoid some of these thundering herds. There have been those of us who have described the basic Slack client as a nicely dormant command and control botnet. Because it is sitting out there with this very efficient pub/sub system just waiting to take commands from the backend. And sometimes those commands cause it to go turn around and DOS our system. And so one of the patterns that we needed to do is really look at that, think about operations, where if we did them at 100,000 or millions of people at a time, what would it actually do? And then spread those out over time by adding simple jitter, or delays, or differing actions that you might or might not have to do. That kind of cooperation between the client-side and the server-side, for us at least, has proven really essential to keeping the site moving as we've been making these bigger and more sweeping changes to the overall architecture.

How Slack Works (2018)

And so, in general, this is the picture there was; this is the nice, happy picture we are now. It's more complicated for sure, but many of, in fact, all of these changes were super important for us to be able to both meet the big scale in demands of our larger customers, and accommodate these new shared channels use cases that complicated the model.

And we're not done yet. I thought it'd be worth it talking about some of the things that we haven't done, and some of the things that maybe future Slack employees might talk about. A few con talks coming down the pipeline. I felt like I had to put up this services decomposed monolith thing. We do still have a giant PHP monolith. We suffer from some of the challenges of a large engineering organization. working on that Monolithic code. And there have been some efforts in thinking about how we want to decompose that much along the lines of what the great folks who preceded me in this room talked about.

At some point, we will be in multiple storage backends. We'll need to figure that out. That asynchronous job queue, now I wish we had some time to look on that. There’s just some really good room for improvement there. Again, continuing on the scale and resiliency features of the product. And then, this whole vector of eventual consistency, I think is going to be super important to future scaling of Slack, where the more that we can adapt the client, so that it still has that real-time feel for the things that really need to be real-time, but is able to tolerate some latency or some lag in the backend APIs, I think is going to be super important for us to continue to have the growth that we've had before.

So, I hope this has been interesting. I think I have a lot more time for questions than I thought I would in the practice, so, thanks so much.

See more presentations with transcripts

 

Recorded at:

Dec 18, 2018

BT