Uber's Chief Systems Architect on their Architecture and Rapid Growth
In this week's podcast QCon chair Wesley Reisz talks to Matt Ranney who is the Chief Systems Architect at Uber, where he's helping build and scale everything he can. Previously, Matt was a founder and CTO of Voxer, probably the largest and busiest deployment of Node.js.
- Expanding a company and team at this rate is genuinely hard. Lots of mistakes have been made along the way.
- Microservices allow companies to grow rapidly but have a cost in terms of aggregate velocity.
- Uber is gradually moving its marketplace development from Node.js to Go and Java. Java is used for the map services.
- Aggressive failure testing is used extensively in Uber.
- Some early design choices - like using JSON over HTTP - make formal verification basically impossible.
QCon Talk and the cost of microservices
- 1m 42s - Having been to a lot of conferences, started to reflect on "if I was going to pick a conference to attend what would I want to get out of it?" And a lot of architecture talks at QCon and other events left me feeling inadequate; like other people- like Google for example - had it all figured out but not me.
- 2m 54s - I want to question that because as I’ve talked to a lot of people I’ve realised that whilst they may have finally figured it out, a lot of their architecture is actually various shades of legacy shambles.
- 3m 21s - So let’s acknowledge this is really hard but also try to give you some ways to save yourself some pain.
- 3m 55s - Microservices have a lot of trade-offs, not all of which are obvious.
- 4m 41s - Adopting Microservices allows you to write your software in different programming languages; so you could have some written in Node.js, some in Python, some in Go and some in Java. Uber has this - and some Scala.
- 5m 22s - Microservices bring lots of benefits, like teams owning their own release cycles and being responsible for their own uptime.
- 5m 46s - Because each team does their own thing, the aggregate velocity would in many cases be slower; the Java people had to figure out how to talk to the metrics system, as did the Node people and the Go people.
- 6m 05s - A hard-fought bug on one platform also has to be battled on another platform.
- 6m 19s - I hadn’t expected the cost of multiple languages to be as high as it was.
- 7m 21s - Uber is in many different data centres around the world, to get the data closer to customers and for availability reasons.
- 7m 43s - At the front end we do TLS termination and we are trying to push that as close to the end users as we can, so we’re using some cloud providers for that.
- 8m 03s - We then have the Node.js dispatch system, now called Marketplace, as Uber expands into other sorts of logistics beyond transportation.
- 8m 31s - Marketplace is gradually moving from Node.js to Go and Java.
- 8m 50s - This is the part of the system that actually has to be up so this is where we do aggressive failure testing and require the highest level of scrutiny about all the changes we make.
- 9m 14s - Uses Riak clusters to manage the state of all the jobs that are in progress.
- 9m 31s - Completed jobs are moved out of the Marketplace system and into various other business logic systems, a lot of which go through Kafka.
- 9m 58s - Various other queues also execute workflow to, for example, prompt you to get a receipt and rate the trip.
- 10m 27s - The map services compute the ETAs and routes for the trip. These are some of the highest throughput systems and are mostly written in Java.
- 11m 07s - All the Kafka streams go to Hadoop for analytics processing.
Managing team growth and culture
- 12m 44s - Such a huge team and product growth rate would not be possible without being able to hire really good engineers. It would be much more efficient if we hired more slowly, but the competition is fierce and we have to work hard to stay ahead.
- 13m 51s - One of the interesting things is that because we were adding people so quickly it wouldn’t have been possible if we weren’t building things as lots of tiny services.
- 14m 44s - Uber's culture is not always cohesive.
- 16m 14s - Uber has a team that manages the in-house build and deploy pipeline. Uses Jenkins as well as in-house developed automation.
- 16m 44s - They also have an observability (metrics) team.
Flaws and crazy growth
- 17m 10s - There are four different languages and many different client libraries, so logging is hard for the Kafka team.
- 17m 58s - There are a lot of flaws but the business is incredibly successful, which fixes a lot of problems.
- 21m 38s - Learnt from a lot of big companies like Google, Facebook, Twitter and Microsoft.
- 22m 10s - Ringpop, which most of the Marketplace systems are based on, is inspired by Microsoft Orleans and is in Github.
- 22m 46s - Currently working on interesting things around their RPC protocol, which will be open sourced.
- 23m 28s - Other systems that share the same design to support aggressive failure testing in production without other users noticing.
- 24m 36s - The failure testing is Uber’s version of Netflix’s Simian Army, but Uber runs a lot of its own data centres as well as Amazon so it’s had to build a lot of its own tooling.
- 25m 28s - There must not be any Nodes in the system that can’t break - anyone should be able to take any Node down.
Verification of distributed systems
- 26m 16s - Caitie McCaffrey from Twitter and Sean T. Allen from Sendence are talking about verification of disturbed systems.
- 26m 51s - The main thing Uber does is have an integration test suite that tries to run traffic in a simulated environment.
- 27m 23s - It's hard to verify how the systems are supposed to work together if you don’t know what the contract is.
- 28m 09s - A lot of early code at Uber was using JSON over HTTP which makes it hard to validate those interfaces.
- 28m 22s - Moving towards type safe interfaces between services; one of the biggest lessons was the unexpected cost of using type unsafe JSON strings for exchanging data between services.
- 29m 28s - Getting interfaces which are type safe and verifiable is a major undertaking.
- 29m 33s - In parallel to this, Uber is doing black-box testing with an army of mobile phones around the world.
QCon is a practitioner-driven conference designed for technical team leads, architects, and project managers who influence software innovation in their teams. QCon takes place 7 times per year in London, New York, San Francisco, Sao Paolo, Beijing, Shanghai & Tokyo. QCon London is at its 11th Edition and will take place Mar 6-10, 2017. 100+ expert practitioner speakers, 1300+ attendees and 18 tracks will cover topics driving the evolution of software development today. Visit qconlondon.com to get more details.
More about our podcasts
You can keep up-to-date with the podcasts via our RSS feed, and they are available via SoundCloud and iTunes. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.