InfoQ Homepage Podcasts Josh Wills on Building Resilient Data Engineering and Machine Learning Products at Slack

Josh Wills on Building Resilient Data Engineering and Machine Learning Products at Slack

Dec 08, 2019

Podcast with

Josh Wills

Wesley Reisz

Josh Wills, a software engineer working on data engineering problems at Slack, discusses the Slack data architecture and how they build and observe their pipelines. Josh, along with color commentary such as the move from IC to manager (and back), discusses recommendations, tips, tools, and lessons Slack engineering teams discovered while building products like Slack Search. The podcast covers machine learning, observability, data engineering, and general practices for building highly resilient software.

Key Takeaways

Slack has a philosophy of building only what they need. They have a don’t reinvent the wheel mindset.
Slack was originally a PHP monolith. Today, it is largely Hack-lang, HHVM, and several Java and Go binarys. On the data side, application logs are in Thrift (there is a plan to migrate to protobuf). Events are processed through a Kafka cluster that handles 100,000s of events per second. Everything is kept in S3 with a large Hive metastore. EMR is spun upon demand. Presto, Airflow, Slack, Snowflake (business analytics), Quiver (key-value store) are all used.
ML worked best for Slack when it was used to help people answer questions. Things like Learn to Rank (LTR) become the most effective use of ML for Slack.
You can get pretty far with rules. Use machine learning when that’s all that’s left.
When you start applying observability to your data pipeline, a key lesson for Slack was to really focus on structured data, tracing, high cardinality events. This lets them really use the tools they were already familiar with (ELK, Prometheus, Grafana) and go deep into understanding what’s happening in the systems.

Subscribe on:

Show Notes

You said that you recently switched back from management to an individual contributor. Tell us about that. -

01:45 I've done technical work for most of my career, over 18 years, and the majority of the time I was doing lead roles.
02:00 When I was at Cloudera, I was the director of data science, but it was largely a DevRel sort of role.
02:05 DevRel is the easiest management job in the world.
02:10 Management is about two things: getting stuff done, and developing your people.
02:20 In DevRel I found the management was relatively easy, because in a sense developing people is the main purpose of the role.
02:30 I wanted every evangelist who worked for me to become a famous and important developer relations person.
02:40 Traditional engineering management has a lot of tedious work that has to get done.
02:50 I would like to give everyone fun stuff to work on, but it's not always possible with when building a company or team.
03:00 I wanted to try my hand at Slack doing real management, where developing people isn't the obvious alignment of what you had to get done.
03:15 I think that I wasn't very good at it for a bunch of different reasons.
03:20 One of them was that spending a couple of years at Cloudera doing talks, tweets and blogposts ruined me for doing any kind of real job.
03:30 I got accustomed to doing more or less whatever I wanted whenever I wanted, using my own judgement as to what was interesting and what wasn't.
03:45 To a certain extent, real management is about finding out what your boss is interested in and wants to make happen.
03:50 I am not particularly bothered about that - in different domains, I have an idea of what we should do, so I think we go with that.
04:00 That isn't always what everyone else thinks is best.
04:05 I can be wrong as well, but in open source projects I can be a fan of the benevolent dictator model, in terms of efficiency.
04:15 But the TL;DR is that it didn't make me a great manager; my reports liked me, but I'm not sure that other managers did.
04:30 I can understand the instinct to move into management after being an individual contributor for so long.
04:45 I started management just after the birth of my first child, thinking that I wasn't going to be sleeping with a new kid and not having technical work.
05:00 In my individual contributor role, I was always fairly brain dead in meetings, so I thought I could be brain dead all the time.

What was your talk at QCon.AI about? -

05:40 When I started working on search at Slack, we wanted to work on ML to improve our search ranking, we had to start building production services.
05:55 I had spent most of my career working on small modules, or doing large scale off-line systems like Hadoop.
06:05 I hadn't worked on production software that had to be up continuously.
06:10 Working with the monitoring and visibility teams, and getting exposed to the tooling and process of the problems was new to me.
06:25 I thought it was quite interesting, and that other people with my background would find useful.

What is the architecture like at Slack; is it mainly a monolith? -

07:25 When I joined in 2015, there was an introduction on my first day called "Slack mostly works."
07:30 In an hour, they explained the entire Slack architecture.
07:40 Back in the day, there was a PHP monolith, much like Flickr was.

Slack started off about eight years ago as a games communications company? -

07:55 2009 was the game, but that was the core to the success of the company.
08:00 Most of the work that Slack does is the chat system and message server.
08:10 It used to be a giant Java binary; now it is several Java binaries, go services, a bunch of different services talking to each other.
08:15 Back when I joined in 2015, a Slack team could be about 4000 users before getting unpleasant.
08:25 Nowadays, or largest Slack team is 100s of thousands of users on the same Slack instance.
08:30 The PHP monolith still exists, but it has largely been replaced with Hack and HHVM; between us and Facebook, we are the two big Hack users in the world.
08:35 We used to shard our database simply per team, not rocket science, and now we use Vitesse (YouTube's MySQL sharding system) to shard in different ways.
08:50 Where it made sense, we carved out parts of the monolith to form dedicated services, written in Go or Java, for doing application-aware object caching, searching etc.
09:15 There's a good article by Shopify, in which they talk about refactoring their Ruby on Rails monolith to modularise and scale into independent systems.
09:25 We are undergoing that same process right now, without stopping the world or building our own data center from scratch.
09:40 Conceptually it's the same, but every piece of the system has got more complicated, and instead of one big monolith it's a dozen different services.

What does the data side look like? -

09:55 Our data side stack is still largely pretty straightforward - it's still based on the system built in 2015.
10:05 Our application logs are written in Thrift, though we're trying to migrate those over to Protobuf as one does.
10:10 It's going through a big Kafka cluster that does hundreds of thousands of events per second.
10:15 We keep everything in S3, we have a central metadata store, what the schema is, and spin up EMR (Elastic MapReduce) clusters from Amazon when we need them.
10:25 We use a lot pf Presto, Airflow, Spark, Hive, Snowflake - which has been useful for the business analytics side of things.
10:35 We have adopted a key-value store from FourSquare called Quiver - it's open-source, but we've taken it in house and tweaked it a little bit.
10:45 We use it for publishing key-value oriented data from our data warehouse into our production systems, so we can serve models for spam, search ranking, and statistics and analytics.
10:55 Slack is not a reinvent-the-wheel kind of place; I generally like that.
11:05 We don't build stuff for fun - we build it because we have to, when we do, we are kicking and screaming.

How do models get deployed in Slack? -

11:35 It's mostly the third side by a wide margin, where most of the heavy duty machine learning is done.
11:40 At its core, sending messages back and forth is the thing we do very well - much better than anyone else.
11:55 The game was the core of the company, and became the message server industrial complex.

When did you discover you needed to do machine learning? -

12:10 There are two different things at play.
12:15 Firstly, we had done modelling in some sense for a long time - there was a highlights product (now deprecated) we launched a few years ago.
12:30 We had a concern of a source-of-information overload, so we tried to attack the information overload problem with machine learning.
12:40 We didn't get far with using it for that; it would be glib of me to say it, but everyone has their own personal preference for what they consider information overload.
12:50 They don't really want a computer telling them what is and isn't information they need to be considering.
13:00 There is a notion that people want their computer to be their chief-of-staff, which can filter out and routes information to you, and only shows things that is interesting.
13:10 I know people who have their own human chiefs of staff, and I don't know that they're comfortable with them getting their information filtered through someone.
13:15 I don't want a chief of staff; I want a customer service agent, and to find (say) everything that Josh Wills has said about Spark, what's the FAQ.
13:40 Like every well run company, we go with what our customers tell us, and machine learning came from our search ranking needing improvement.

So search ranking was the entry point for machine learning in Slack? -

14:05 It was the easiest low hanging fruit to bring ML in, to improve search results without too much trouble.
14:15 With ML these days, we can do anything we want if we have the data; the hard part is getting the data and figuring out what the product should do.

How did you structure the data to get to the first ML product? -

14:35 The first question is: what are you trying to achieve?
14:40 The answer was we wanted the search results be good and is relevant to the query they gave.
14:45 So, how do we know what is relevant information?
14:50 We aren't Google, and we don't have a link structure for Slack teams to analyse, we can't build PageRank per se.
15:00 We have to go with user behaviour; what gives an indication that the search has been good?
15:05 If the answer you want is right there in the page, you just exit the search.
15:10 Does it mean that you have found what you are looking for, or does it mean that you didn't find anything and just gave up?
15:20 We can't see your messages, so we can't see your queries, so this is abstract behaviour for us.
15:25 We have no idea what your query is, we have no idea what the content of your messages are - we have to rely on behaviour.
15:30 By a wide margin, the most positive quality signal we get, is when they share a search result into a channel.
15:45 That's the best possible world for us, so we train against that for clicks, scrolls, discount factors, but sharing the result is our key star.
16:05 Once we have that, all the tools - logistic regression, XGBoost, deep neural networks, is just there for you and you can go to work.

What does the process look like for getting the trained model into production? -

16:35 We typically use trainer models in Spark (a Java/Scala based library), but our monolith was based out of PHP.
16:55 We tried to build an XGBoost binding for PHP via C++ and JNI/Java, but we ultimately thought it would be easier to have a pure Java service.
17:15 At that point, deploying the model became deploying the model to S3, load it into the service and you were off.
17:25 The hardest part about this was building the Java service - for better or for worse, Slack has a reputation for not being the most reliable/stable service.
17:35 Some of the most recent unpleasant times in my life has been when I have been on Twitter when Slack is down and seeing all the messages.
17:50 A year ago, we got very serious about reliability and where we had been prioritising moving fast and iterating as swiftly as possible, we prioritised slowing down and being up and stable.
18:05 If we had not had this, we would not have discovered any of this monitoring stuff, because the Java binary is up and running.
18:25 We started with a library called Armeria [https://twitter.com/armeria_project] written at LINE Corp by Trustin Lee from Twitter and Netty.
18:55 It's a really good library which comes with logging, thrift, grpc, http, and it's fast and efficient.
19:05 We started building up our Java stuff on top of that, which was what exposed me to this world.
19:15 The enormous amount of work we had to do in order to serve an XBBoost model in production was humbling and educational.
19:25 Serving it is easy; serving it reliably was the problem.
19:35 It came with prometheus and logging out of the box.

So you inherited the platform based on what you were using? -

19:45 We had an existing development environment - Chef, Consol, Logstash.
19:55 When we had a Java service and building things, you were building on a refined set of Google queries.
20:00 I'm not looking for a student project; I'm looking for something that has been developed by serious people at serious companies at serious scale.

What do you instrument for a machine learning system? -

20:30 You start with the obvious stuff, what's going to screw you.
20:35 What are the inputs; what am I seeing? Do I expect a string, and what's it the length likely to be?
20:45 I'm generally loathe to go deep into instrumenting the black box of the model itself, because I'm planning on swapping that thing out.
21:00 The thing that generally burns me is that the inputs have changed or drifted substantially and the model is no longer fit for purpose.
21:15 I feel it's the kind of thing where a developer takes for granted.
21:20 Whatever we're calling Swagger now, or Thrift or grpc, you fully specify the API of what you're expecting to see.
21:30 Since you're not feeding inputs into Spaghetti code that you wrote yourself, you don't need the same level of detailed specification of the infinite dimension inputs in a machine learning problem.
21:45 In theory, the query can be anything in some sense - the complete set of inputs available in Unicode could be a vlid query, for instance.

So you're just logging standard information like you might do for any system, machine learning or not? -

22:05 Exactly - it was something that I took for granted as an API developer.
22:15 It's how we test in production - you can't possibly test every input, or auto-generate the validation code for it.
22:40 There are some companies talking about how we could do machine learning specific observability tools.
22:45 I'm not there yet; the tools we have seem pretty good to me.
22:50 If you're developing a new tool just because you haven't bothered to get familiar with existing tools, we're getting back to the world of why are you reinventing the wheel.

Where do you calibrate your thinking to align with machine learning/ -

23:15 In theory no, in practice yes.
23:25 I have always had a religion around structured data and structured logs.
23:35 I have really got a religion around tracing and spans and that level of structure.
23:40 Leveraging tools like Logstash and Prometheus are great.
23:50 We use Honeycomb here; we use traces and spans.
24:00 High cardinality events, where I can generate a unique identifier for every single request, and trace it through the entire system from talking to my key value store to my web server and ranking server and solar cluster and back again is absolutely amazing.
24:15 I feel like I am getting the speed of Grafana and Prometheus data, and I'm getting it quickly, with the level of deep dive that I can get from Logstash.
24:30 I had the Spotify infra team in here for lunch one day, and I was asking them what their Logstash cluster looked like.
24:40 They said it was a total disaster and dumpster fire, because you can throw anything in there and that's what you do.
24:45 If you take the time to provide a little bit of structure, you can let the back end query system do so much of the work for you and optimise so much that you can just work and iterate so much faster.
25:05 You can trace individual users, the specific version of the software with a specific feature flag being on, lets you dig into the data at the speed of Grafana.

Any tips or tricks? -

25:35 I'm the leading machine learning expert who tells people not to do machine learning.
25:40 Exhaust other possibilities before turning to machine learning.
25:45 We got pretty far with some simple rules before we needed to turn to machine learning.
25:55 You can get a long way without machine learning.
26:00 I don't feel that most machine learning practitioners know what is required to build a production service.
26:25 A mentality shift is needed for hosting production services.

What are the different rules in your 2015 ML test paper? -

26:45 It was broken up into four different subsections of building machine learning models.
26:50 Your off-line data validation and testing, from your off-line monitoring solution with counters or accumulators, like Prometheus does.
27:00 Anomaly detection with your off-line modelling systems is incredibly valuable - almost as valuable as it is on your on-line systems.
27:10 What I suggest is you have a one-to-one mapping from concepts in on-line production monitoring to off-line monitoring of systems.
27:20 A lot of the times its just a case of re-naming systems, but a lot of the principles apply in both places.
27:25 The rubric is exhaustive and is fantastic and is generated by people who have been burned very badly by unfortunate machine-learning driven outages of production systems.
27:35 It's also so comprehensive as to be able to cover all fifty different checklist items, and I think the 80%:20% rule applies here.
27:45 I think when I retire, I will write some way of evaluating which ones you should do.
28:00 If you don't have an infinite set of inputs, you don't actually have a machine learning problem.
28:10 Once you do acknowledge you have an infinite set of inputs, being methodological at every stage, whether testing, monitoring, off-line training, on-line serving is coming your way.
28:30 The other thing the paper doesn't cover is how tightly coupled the off-line and on-line systems have to be.
28:35 What I see is that you can do everything right off-line, and you need six months with all the training and testing, the production world might have changed and the model may be useless.

What are you going to be doing? -

29:05 After I left Slack at the start of November, I will be unemployed and be a dad for a while.
29:15 I moved to San Francisco in 2007, and worked at Google, Cloudera and Slack.
29:30 I'm pretty tired, and I'm looking forward to travelling and getting bored, and then figure out what to do.

References

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.