BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Driving Technology Transformation at @WeWork

Driving Technology Transformation at @WeWork

Bookmarks
49:03

Summary

Hugo Haas talks about the platform and architecture behind WeWork’s technology transformation over the past 2.5 years. He outlines some of the unique technology challenges WeWork faces – global systems across China and the rest of the world, hybrid infrastructure between the cloud and on-premise physical buildings, etc. – and describes in detail how WeWork is tackling them.

Bio

Hugo Haas is a Fellow Engineer working on WeWork’s Developer Platform, which provides development tooling as well as service and data infrastructure to all WeWork engineers. Prior to joining WeWork, he worked as a software architect on deployment systems at Salesforce, the 2015 Flickr relaunch, and also led a re-architecture of all of Yahoo!’s media properties.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

[Note: please be advised that this transcript contains strong language]

Transcript

Haas: My name is Hugo Haas, I am a fellow engineer at WeWork. I work on developer platform which is all the infrastructure and tooling that WeWork software is built upon. In my career I've worked on a lot of large distributed systems at different companies before joining WeWork. What I wanted to tell you about today is why technology is important to WeWork, and I'm realizing that this may not be completely obvious, why we're going through a transformation, and as Randy hinted, talk to you about a couple of problems concretely that we have because of our journey, because of our business, and how we're tackling them.

Let's dive in, and I need to say just a few words about WeWork to level set. A lot of people have heard about WeWork. There's a good chance that when you think about WeWork, you think shared office space, startups, freelancers. It turns out that - and it's important background for what I'm going to be saying next - WeWork is a lot more than that. First, we are a global platform for organizations of all sizes. We do have lots of customers who are freelancers and small businesses. We also have large enterprises as our customers, and we don't normally offer space to those customers, to our members, we also build a community, put them all in touch, and we offer services to them. That's starting to hint why we would need some technology to do this.

A couple of numbers that I also wanted to start this talk with, are about our scale today. We have 466,000 members locally, 485 physical locations, and those locations are in 105 cities in 28 different countries. It's a fairly broad footprint in multiple countries. Finally, engineering, we are now roughly 1,000 people in technology in four different hubs globally. This has grown quite a bit over the past few years.

Why a Technology Transformation?

With all that said, let me start talking about why a technology transformation. The first thing to understand about technology at WeWork is that we have a very broad set of problems that we solve for. I'm going to list a few things, real estate management, efficiently sourcing where we're going to be placing our new locations, how to price them.

I'll give you an example, there are right now 60 WeWork locations, in New York City. Should we open a 61st? Yes? No? If so, where? How should it be priced so that obviously the company makes money, but at the same time, we attract customers, and when we open the doors, we already have a fair amount of booking in them, and the offices are not empty? That requires gathering and crushing quite a bit of information to do that type of stuff. Then there's obviously technology to help with sales, billing. I talked about community management. That was the second thing that was mentioned in tools for the community.

One of the things that we do is that we learn about our members and we help them get in touch with one another. For example, you may be in a WeWork office, need help with some logo design, and it may turn out that down the hall or in another office two blocks down, there's somebody who does freelance design work, and so we can help with those connections, and our members really like that type of stuff.

As I talk about doing business also with enterprises, we also offer insights about how employees use our space and be able to surface what works, what doesn't work, some inefficiencies. There are actually quite a lot of things going on when you think about what type of technology does WeWork have.

The last aspect of our journey and why I'm going to be talking about this transformation, is our growth through the years. WeWork has been in existence for about nine years. If you look at those numbers, and you look at the locations number and the memberships number, every year those numbers have doubled. What that means is that whatever solution you were thinking of using to figure out where you wanted to put new locations two years ago, now the dissertation from two years ago may be a little bit not working as well, we'll say. Some of the things that we used to do initially very manually with spreadsheets, we really have needed to move away from these.

Here I feel I need to pause a little bit. I talked about coming up from very few locations, spreadsheets, doing things manually. Who here has been in a company which started with a business and then they realized, "Oh, shit, we need to do more technology and we need to become a technology company"? A fair amount of people have done this, and yes, this is the situation in which WeWork ended. I like to think of WeWork not that long ago as the Wild West. Each of those domains that I described evolved pretty independently, worked a little bit like their own startups, made their own choices, and we ended up with very different decisions; I would say, a balkanized set of technologies.

The other piece of this puzzle in terms of technology organization versus not, is that initially, technology wasn't a joint for the company. We were providing office space, and yes, we needed to help with room booking. We needed to put people in touch, but the primary thing here was, "Let's find an office and let's open a new WeWork office." What that means is that operational excellence, engineering excellence, was not a primary concern for where we started. If you now look at what I'm talking about in terms of the scale of the real estate sourcing, workplace insights, these are some real engineering problems and we really have needed to step up our game.

In order to tell you more about how we've done this, I'm going to be using an overly simplified initial stack, and I'm going to tell you essentially three stories about how we're evolving things. When we started, we basically ran in a public cloud, U.S.-based; some stuff was running straight on AWS, other applications on top of Heroku. They had a little bit their silo, one app talking to each database and they were doing their own thing. I mentioned each of those made their own decision, and I used here the example of observability, how they were doing logging, metrics collection. They all chose different solutions.

One of the problems that came out of this, in addition to pure vendor management, is that if you are trying to figure out what's going on across multiple systems, you need to figure out where everything lands and try to join all these. That was an interesting problem. The second one is, as we realized, "We cannot do everything by hand and it's time to gather all of our data," the company naturally evolved to using a warehouse and taking all these data and putting it into the warehouse to do analysis on it, but the data hasn't been well thought-out to nicely join, and extracting insights from it was not straightforward.

The three things I wanted to cover today - I'm going to tell you the role of developer platform, the group that I work on, about how we are consolidating and unifying some of the solutions that we're using, and how we are measuring the progress we're making. I'm going to be talking to you about how we are dealing with all of our data and moving from thinking about data after-the-fact, to thinking KPI first. The last thing that I'll talk about is how we're moving away from a pure U.S.-based clouds footprint.

Developer Platform as a Transformation

Let's dive in with developer platform. I talked about all those domains, and very similar to Julia Grace this morning who said that when she joined Slack, there was no infrastructure group, there was no developer platform group. Developer platform is under all of this, and very similarly, about two years ago, this is about when the developer platform group showed up. We had to go and look at all those different solutions and how we wanted to consolidate all of those. The mission that we decided to set for ourselves, the vision, was and still is, enable WeWork to build better product faster. I'm going to be talking a little bit more about what that means and how we're looking at this.

The way we thought about doing this was, "We have a lot of different tools and it's very hard for an engineer to reason about all of this, even know everything that exists, so we're going to be providing tooling and infrastructure to enable our engineers to refocus on their domain problems." I talked also about our lack of engineering excellence culture or operations excellence culture, and we thought, "What we're going to do is that as we choose those solutions, we are going to either enforce some behaviors, or put good defaults so that this changes." One example is that we're working on allowing to define service level objectives the same way that you define your service. As you deploy something on our platform, you then have some alerting based on some goals that you have. All of a sudden, the developers are much closer to production and what's happening in the real world.

In terms of types of solutions that we offer, we cover everything from developer experience, testing CI, infrastructure, data platform, observability. Some of the choices that we've made I wanted to highlight a little bit here, and I'm going to be talking about a couple of those pieces in more details. We have gone down the GitOps road and everything, whether obviously, [inaudible 00:13:45] code changes, but also deployment configuration changes happens through git commit, CI/CD, sending the artifacts, building and sending the artifacts on our run time platform. I talked about the infrastructure, so compute, storage, and our data platform on top. I'm going to be talking in more detail about compute and data platform in the next couple of sections.

Then, providing out-of-the-box observability so that our engineers are part of the loop, they really own what is running in production, the DevOps model. That's the pattern that we've been using to build our platform and continue down our journey.

I wanted to go back to one thing that I said earlier, which was this goal that we set for ourselves, enabling WeWork to build better product faster. We talked quite a bit about what does that mean and how to measure it. We thought about it in terms of iterations, especially a lean startup model and measuring. What we ended up doing actually, is leveraging some research that was published fairly recently in the Accelerate book about building and scaling high-performance technology organizations. I'm not going to go into a lot of details about what this book says, and I'll just jump straight to the conclusions from the book. I do encourage you to read the book. It's very interesting and eye-opening.

If you think about developer platform, I said that we were providing tooling from developer experience to runtime, to observability and monitoring. Basically, we are providing tooling for the entire development life cycle, and so that gives us the opportunity to instrument all of our pieces and get some insights for each of the projects and each of the applications that run on top of the developer platform. This is what we are going down the path for, extracting engineering excellence metrics out of the developer platform for each of those applications and reporting them for the team. I'll jump straight to the conclusion of the book, spoiler alert, but I still recommend reading the book.

There are four metrics that have been discovered. There's remeasure; how well you do as a high performing technology organization. Lead time for change, meaning how long it takes between the time you commit code until it's running in production. You can just think about this as, "If there's a lot of manual QA, this is going to be long.” Deployment frequency; how often you deploy software. Here you can think of it as if you deploy less frequently, either you just do fewer changes or you batch them all up more and that probably brings some bigger risks of breakages.

Time to restore service, if something bad happens, how long will it take for you to recover from it? Here it takes aside these things like monitoring, and then goes back to how fast can you actually diagnose the problem and launch a fixed version? Change failure rate, which is how often did you release something and it actually broke?

These were the four metrics that appear to be really good indicators of a high performing technology organization. We are instrumenting our developer platform to be able to track all this, and obviously, developer platform helps with some of those. How teams also behave and perform is another factor, but the goal is to improve on all of those four.

Data Ecosystem in a Fast-Paced Environment

I've been talking about developer platform as a whole and a journey that we're on. I said I would touch on data platform and compute as a couple of examples of choices that we're making. Let's go and start with the data platform piece. I call this data ecosystem in a fast-paced environment. If you think back to what I described earlier, the situation where we landed a few years back was everybody developed their own application. They had their little database, or big database depending on the service, and then we after the fact, took all of this on the warehouse and did some processing there. There were some problems with this.

How do we fix this? We're working on building a data platform so that we can think about KPIs first, what are the events that we care about, capturing those events, processing them, deriving insights from them, maybe near real-time, maybe after the fact. This is what the data platform does, and I'm going to go through some pieces at a fairly high level of the data platform that we are building and some of the choices that we're making. First step, capturing the events, and we have this event collector API, which is a fairly thin API on top of Kafka. The second building block is storing those events after you've collected them. There are two types of storage. One is in motion, and this is where we use Kafka. Then it's at rest, that can be either in a data lake or in a warehouse. I put an asterisk on things that we are still discussing here, we're looking at things like Delta Lake, Iceberg.

Now that we have the data somewhere, either in motion or stored, we need to compute, to process it with some compute. There are three types of compute here, stream processing, batch processing, some sequel interface. Lots of discussions going on about which one we're going to do. Finally, consuming all of this either through exports or through some querying interface, visualization interface. On top of all this, we have a scheduler and we use Airflow so that we can trigger jobs, trigger some other jobs in case there are some dependencies there. We built all of this - or we're in the process of building all of this, I should say - as a good foundation so that we can say, "Hey, these are the metrics that I want to keep track of and I want to have more visibility into everything that's going on, either in real-time or analyzing a whole lot more data than initially."

This is all well and good, but back to where we started, which is every domain does its own thing and everything lands on the warehouse, this doesn't prevent you from getting into a similar problem where there's lots of data that flows through this, but you don't know exactly what it is, where it's coming from, if it's going to disappear because it was something from a researcher. One thing that we are doing on top which is different from your traditional just pure data platform solution, is focusing on metadata management to keep track of all of this. I'm going to talk about how we do this.

First, I'm going to be talking about why. Hopefully, I talked about what can happen if you don't have this. Our motivations here is that we really want to move to a data-driven culture. To do that, we want to democratize the use of data. Part of democratizing the data is making it super easy for everybody to use, and that's the self-service aspect, and also decentralize it so that you don't have the data platform folks who are a bottleneck in the middle.

The second piece is building trust in data. I'm sure that this has happened to some folks. You build some processor on top of some data and then the data schema changes and you didn't own the data that you were working on top of, so everything breaks on your end, and all of a sudden it's, "Well, my stuff is completely broken." Those are the type of scenarios that we want to avoid. We want to be able to provide some guarantees around the quality of the data so that independently I can know that I'm working on this dataset and this is its schema, and I can build on top of the schema and we'll be good going forward. The last piece of the puzzle is that I talked about knowing where the data is coming from, and it's important to give context to the data. Who created this data? Where is this data coming from? Is it production quality research? How fresh is it? How often does it get refreshed? There are lots of aspects that need to be surfaced here in order to provide all this.

In order to do this, we've built this metadata service which sits at the top, which is called Marquez, and there are both an API and a UI. The idea here is that you have stuff going on in your data platform, batch processing, stream processing, you're just collecting events. You're recording all of this, the metadata about, "What's the schema? Who are you?" Marquez takes all the records of all of these, and then with the UI as a human, you can see what's going on and sift through a directory and get some visibility about everything that is in your data platform. One thing which is interesting here is that going down this path, we're working on tagging of specific attributes in data sets so that it allows you to keep track of things like PII and GDPR compliance type of things.

From the way we're talking about it, you know that I haven't called out any specific technology here, we're building it in a very modular way. Whatever technology you're using, you can build a module that registers a data set or a job that is running, and Marquez will happily keep track of it. One thing that we've done differently from a lot of different projects, is that we've been doing this open source from day one. If you go to github.com/MarquezProject, you will find the core metadata service with the UI and the API. You will find Java and Python clients. I mentioned that we use Airflow, you will find a module, a library, that allows you to register metadata about all the Airflow DAGs that you have.

If you're interested, we've been participating in contributing to this, obviously. Stitch Fix has also been participating in it, and if you have an interest, I strongly encourage you to go and check it out and play with it. That's how, in addition to stepping up how we process and store data, we are taking care of not going back to being the Wild West and keeping it healthy and well-organized.

Infrastructure Needs for WeWork’s Footprint

The second thing that I wanted to touch on, which is even more specific to WeWork, are our needs around the infrastructure that are linked to our footprint. I want to go back to the slide where I showed where we have all of our locations. As you can see, we are in twenty-eight countries, we're in lots of different places. There are a couple of callouts that I want to make. First, we have offices in Australia, we've announced an office in South Africa. These are two places which are very far from a U.S.-based cloud footprint. I don't know if anybody here has dealt with serving media to Australia and dealt with the latency and bandwidth issues. There are some significant problems around serving places far away like these.

Our second callout is China. We are currently in China, and I'm sure that there are people in the room who have dealt with looking at bringing their service in China, and the tech stack and the vendors there are different. When we think about providing our services, we think about providing our services globally the same to everybody in the world, which means all those little dots need to have a similar experience with the same type of services. This is not completely novel. I'm sure there are lots of companies here who have global concerns. Why would you want to go from just a U.S.-based footprint into a more global, international? Because of availability concerns.

Sometimes quality of fiber connections between continents, and maybe flaky performance issues, sometimes you have data resiliency problems. The data needs to live in a particular country. I started by saying that we started with a U.S.-based footprint. That forces us to go into a global footprint. What that means is that as we look into all of those different geos, using a single cloud vendor becomes a problem because there are places where you cannot use the same vendor as another place, and China is a prime example for that.

How are we approaching this problem? I talked earlier about developer platform providing this compute using containerization and providing this compute box. We created this thing called WeK8s, and I'm sure that if you have a Kubernetes project, you may have a similar naming convention of some kind. I've seen it in a couple of companies. WeK8s is our managed Kubernetes offering, and what we're providing with it is Kubernetes in order to schedule containers, Helm in order to define packages and what you're going to be deploying, and a number of additional services on top; currently service management with Vault, service mesh with Istio, for observability we use Grafana and Prometheus. With those building blocks running on WeK8s, you can start running most of our applications. I am not covering the storage aspects here, but focusing only on the compute piece.

We have our engineers deploy their software, and we're currently using Argo CD. The applications on top of WeK8s, and as far as they're concerned, they are deploying on top of Kubernetes using Helm. What's under WeK8s? WeK8s and Kubernetes is just abstracting whatever is under it, and if a cloud provider has a managed Kubernetes offering, we can essentially use this, whether it's AWS, GCP, [inaudible 00:33:31] cloud, run WeK8s on top and provide configuration in order to make all the rest work well. The engineers don't need to know about the bottom layer. Another benefit is that you go on your laptop with the Docker desktop application, you can also bring up a Kubernetes cluster and we can also run WeK8s on developers' desktops. That's our approach for multi-clouds.

Randy hinted at multi-cloud and hybrid. Let me talk about a second thing, which is even more specific to WeWork. WeWork is not just a cloud company, we have offices. These are hard things that you can touch, and we have problems to solve around inside those offices. I want to use one business problem specifically to illustrate the type of concerns that we have. I'm going to be talking about space access. A couple of numbers I'm going to repeat: 466,000 members and 485 location. The bottom picture shows a WeWork card, this black card, and that allows you to get into where WeWork office. Some of our members have access to one building, other members have global access and have access to all of the buildings. The bottom line is that that's a lot of cards to have recognized with a whole lot of badge readers around the world, and this is stretching the state of the art in terms of scale.

The second example that I wanted to give is, Meetup is a WeWork company. Meetups are typically happening after hours, and it won't be a big surprise if I said that there are a fair amount of meetups that happen in WeWork offices. The interesting thing here is that it's after-hours and typically community managers are gone after 5:00, so an interesting question is, "How do you get into the building?" You can always have the organizer hold the door, etc., and let people in, but wouldn't it be nice if in your application you would say, "Hey I'm going to attend this this meetup," the organizer says, "Yes. You are in." Then you can just use this app and for the three hours that the meetup is scheduled for, you can just get into the building, get into the room, and then you're just locked out. That type of dynamic access again is stretching the state of the art, especially if you think about the scale at which we need to do it.

These are the types of problems that we need to do, and you could think about doing this with a cloud-only solution. One of the problems here is that we've all been in the office and lost the internet. It tends to suck, you can't do email, you can't do Slack. These days, everything is in the cloud, so when you lose the internet, things typically ground to a halt. Imagine that if all of your badge readers are being driven from the internet, and all of a sudden you cannot get in and out of your office. We're trying to avoid that type of problem. This is what is leading us towards bringing some logic compute storage in buildings. This is an example of a growing number of use cases that we have, and a little bit similarly to why we're looking at a global footprint, we may need to move some of this processing and storage onsite because of availability concerns, latency concerns, bandwidth concerns.

We do have a few challenges here. Number one, these are offices, which means these are not data centers. We have IT closets, and it's not like tons of racks, great cooling. There are space and cooling issues that we need to consider. The second problem that we may be facing, I talked about no onsite technician. In our buildings we have community managers, and those community managers do a great job, but managing the office, being good hosts, connecting people, they are not going to log onto servers if something goes wrong or say, "Can you open up this unit and see if you could replace the hard drive?" We need to think about how we handle this problem.

The last piece of this puzzle is that if you think about the scale of this, you're going from a handful or a couple of handfuls of public clouds locations globally to one of those clusters in each of the buildings. All of a sudden, the scale of your deployment goes up one or two orders of magnitude. This is also bringing interesting challenges here.

How are we doing this? The beauty of WeK8s and using Kubernetes as this abstraction is that we're also working on bringing a third type of substrate which is a non-print substrate that we can run on our hardware in IT closets. I mentioned all of our problems or potential challenges with the fact that we don't have technicians. As we are doing this, we're thinking very carefully about automating everything, such as you take one of those computers and you bring it into an office and it can get imaged from the cloud completely automatically with PXE booting, and if something goes wrong, we can just reimage it. The type of maintenance that would need to be done would probably be either we can recover, or we just swap. That's how we're thinking about our hybrid infrastructure and how our footprint is very different from one that would be of a cloud-based company.

Takeaways

Conclusions, a few takeaways. We are seeing redeveloper platform as a cornerstone of our technology transformation. We are providing solutions to our engineers to simplify their lives, and this is how also we're measuring that we are making progress. With regards to managing our data and all the data that flows into the data platform, we are building a metadata service Marquez in an open-source fashion. The final piece of the puzzle that I presented today is that we are moving from a U.S.-based infrastructure to a global and hybrid infrastructure to run all of our applications. A lot of this is still early days, and these are interesting problems. If you get excited, we are hiring in all of our hubs, and I will be happy to stop here and hear the questions that you guys have.

Questions and Answers

Participant 1: You talked about your application infrastructure being Kubernetes based and the whole platform put together. My question would be, you're deploying it in different environments - do you use any of the higher-level cloud services, like inter-process communication, managed databases, and so forth? That's one question. The second is you’ve got a very large data environment that you're building out, and yet you have many clouds in different places. Do you bring all that data back to one place, or are they analyzed in a regional location?

Haas: These are great questions, I'm not going to have very good answers yet. One of the reasons is that we are in the process of going there. A couple of things that we have, we have multiple clouds today because we are in China and in the U.S., so we have at least two clouds. We do join some of this data. Your first question was around, how do we keep all the data in sync and managed databases.

Participant 1: Just leveraging the other cloud services. Inter-process communication, managed databases, notifications.

Haas: Right now a lot of this is work in progress. Applications tend to run right now scoped to a single cloud/location. In terms of alerting, we do bring all of this back to a single and central alerting and observability solution that provides us with a global view.

Participant 2: The engineering excellence metrics, what is their use? Who's responsible for understanding and reacting to them? Is it the platform team, the individual development teams, the whole organization?

Haas: I'm going to give two answers here. We are currently rolling them out. We haven't gotten through the process of how to act on them. Right now we're just trying to surface them because in a lot of places we were driving blind. To answer your question maybe slightly differently, these are quantitative signals that we have. In addition to those metrics, we're actually rolling out to qualitative assessment by each of the scrum teams, each with a framework - we call the level of framework - which allows them to reflect on, "How am I doing on development? How am I doing on testing? How am I doing on instrumentation?" whatever it is. Here, it helps them reflect, and we also have a broad view of how our organization is going.

Participant 3: For those metrics, do you have any tools that you can share with us that helps you to get other data? For instance, the cycle time?

Haas: You're talking about the engineering excellence metrics. We are instrumenting a few things. Basically, the idea here is to be able to tie what's going on in Github with what's going down the CI pipeline, then lending on Kubernetes which is really when something is up in production. We're taking this and instrumenting CircleCI Kubernetes to get signals when something gets deployed, when something gets built, and keep track of those. That only covers part of the problems, because that gives you the deployment frequency. Based on the commit and what's in the repository, you can know how long the lead time was.

That doesn't help you with the problems of change, failure rate, for that part we are going to need to use GRI. We use GRI internally, and through some work Randy's doing around incident management, are able to tie some of the deployments to incidents that we've had and how fast we're recovering. Some are easier and will lead a stronger signal, and others we need to figure out, and it's a little bit of having the right processes in place so that we can get these data.

 

See more presentations with transcripts

 

Recorded at:

Jul 29, 2019

BT