InfoQ Homepage Presentations Scaling Uber to 1,000 Services

Scaling Uber to 1,000 Services

View Presentation

Speed:

Download

52:31

Summary

Matt Ranney talks about Uber’s growth and how they’ve embraced microservices. This has led to an explosion of new services, crossing over 1,000 production services in early March 2016. He “acknowledges the pain” in growing the architecture at an explosive growth company, and he describes how we might benefit from their pain points and learn from one of the fastest growing software companies.

Bio

Matt Ranney is Chief Systems Architect at Uber, where he's helping build and scale everything he can. Previously, Ranney was a founder and CTO of Voxer, probably the largest and busiest deployment of Node.js.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Key Takeaways

HTTP and JSON was designed for browsers; using RPC is better for computer-to-computer requests
After a certain age, microservices should become immutable
Having multiple languages allows for team preferences, but segregates developers based on language and prevents easy re-use of code across services
Monorepos allow for changes to be made across multiple services atomically, but prevent future open-sourcing and subset checkouts without specialised tools (e.g. FUSE drivers)
Performance problems are difficult to debug cross-language without standardised service dashboards and tools
Logging should never slow production down; in a failure storm, the logging system should drop rather than delay

Show notes

0:30 Rate of growth has been pretty incredible
0:40 Video showing traces of Uber cars in Beijing at start of 2015
0:45 Video showing traces of Uber cars roughly a year later (end of 2015) - significant growth. (Also shows how mapping car journeys can show the road layout comprehensively with enough data points.)
1.10 Video showing traces of Uber cars in Yangzhou at the start of 2015
1:12 Video showing traces of Uber cars in Yangzhou at the end of 2015, with significant growth
1:30 As of June 2016, Uber has 400+ cities worldwide, with a bit less than 7000 employees in 70 countries
1:40 200 engineers when Matt started to 2000 engineers growth in a year and a half
2:30 Tech talks tend to focus what’s new and awesome, which is why there are lots of conferences talking about the same kinds of things
4:50 Exponential growth in number of services - started at 20 a couple of years ago (mid 2014)
5:00 Grows to 1200 services running in production at Uber (mid 2016)
5:30 Started with microservices, no significant legacy to work with
5:55 Get benefit of being able to release independently instead of releasing a single monolithic application
6:15 It’s better to have people come along and get running very quickly instead of breaking the whole thing
6:30 Individual teams can own their own uptime; they control the release process and so can deal with it
6:45 If you develop your own service, you own it - including being on pager duty when it goes down
7:00 Microservices let you use the best tool for the job - what might be best for you might not be best for your colleagues or your company
7:30 Some costs to doing everything as a bunch of services; instead of having a simple system, you now have a distributed system, and these are hard to understand and debug as well as causing the trickiest of outages
8:25 You have a tradeoff of developer velocity with operational complexity - you probably want your developers to move quickly, and you can probably get good at operational complexity parts, so probably a reasonable trade-off. But it’s surprising about how much operational complexity comes from breaking the system into little parts.
9:15 Do we ever turn services ofF? Yes and now. We turn off a couple every week but add a lot more.
9:35 As older services come more mature, it might be reasonable to never touch them, or have immutable or append only microservice architecture.
10:00 The only time the system breaks is when you change it.
10:15 The time when Uber is most reliable is on the weekends because that is when the Uber engineers aren’t making changes.
10:35 After a certain age, your microservices should become immutable
10:50 Consider the cost tradeoff of keeping the services around versus the cost to remove them
11:10 Everything is a tradeoff, and is extra true for large microservice deployments
12:10 Build around problems instead of fixing the old software for solving individual problems, but perhaps not better for the business as a whole
13:30 You can keep your biases for your own preferred languages because it’s easier to spin up new services than migrate old ones
14:10 Uber started with a lot of NodeJS and Python, but has been moving more towards Go and Java, with the result that production microservices are written in four languages.
14:50 Hard to share code, such as fundamental shared services.
15:20 Hard to move between teams even if it makes sense to move around, since it limits where the developers than go
15:40 Having a polyglot environment results in artificial barriers being put up, resulting calling ‘go people’ and ‘java people’
16:05 These distinctions shouldn’t even be made and is a hidden cost.
16:45 Most services were built on top of HTTP
17:00 You realise that HTTP was for browsers and flexibility in being interpreted such as query parameters, paths, headers, methods, return codes, which are useful talking to browsers but not so great for talking inside the data centre.
17:45 Started with JSON because it’s the obvious choice and is human readable, but it’s complicated when dealing with four different languages.
18:00 Without having a way to express types or validate it can be complex, is slow and adds serialization costs of the data.
18:45 Servers aren’t browsers; using HTTP and JSON for servers adds complexity - typed RPCs are the way to go.
19:10 How many repositories should you have? One? Lots? Monorepo users who have used a single repository extol the virtues, but open-source developers think individual repositories are the way to go. It’s a cultural decision.
20:00 With one monorepo, you can make a change that cross-cuts across several services in a single go. (Or roll it back in one commit). If you have multiple repositories, that becomes a lot more difficult.
20:40 With one repository it’s not easy to be able to open-source part of your repository or be able to check out the entirety of the repository.
21:30 Count of Git repositories internally; April 2016 - 7k total, May 2016 8.2k - June 2016 - 8.8k total, containing around 400 configuration repositories.
21:50 Many services have two repositories; one for configuration and one for the service since they are deployed separately.
22:15 What do you do when things break? There’s a difference between one thing being broken and one thing in 1200 being broken.
22:25 Performance is dependent on language tools
22:40 Each individual thing seems pretty fast and it can be tempting to not consider performance, especially with multiple services in multiple language. The same tool isn’t easy to be used in different systems or languages.
22:50 How do we view the entire system? What happens if the failure has occurred because of a failure in a dependent service?
24:45 I wish we had created a standard dashboard automatic for every service and it all looks the same, as opposed to having teams individually deciding what is important.
25:00 There should be a standard dashboard even if teams want to add additional content. Otherwise you end up with a custom dashboard and cross-service dashboard issues preventing developers from understanding issues with dependent services.
26:05 Performance ends up not mattering; you start off having no performance problems, but eventually you will and if you don’t have some way of introspecting the performance of the system it will be really hard to bolt on later.
26:30 You probably want some kind of baseline SLA to be include in the definition of processes not working; but make it for every service to have an SLA which is monitored.
27:00 If you have a bunch of services and you don’t know what their SLA is it’s really hard to add later
27:30 Due to fan out, the latency of a request is dominated by the latency of the slowest part of that fan out.
28:00 If you have a service that is pretty fast but 1% of the time it takes a second, then you have 1% of your users having the delay. But if you have 100 of them, that means that 63% of your users will be seeing slow cases somewhere in the path.
28:50 Tracing is important, and can be as simple as passing an identifier through the stack of the request and log it out, and with some kind of log search stitch it back together, such as Zipkin
29:35 Graph showing a tracing system exposing an issue with a slow service
30:20 Graph showing a trace of a nested fan-out of thousands of RPCs for a single request individually instead of using batched requests.
31:40 Another example showing an ORM that made thousands of calls to the database
32:30 Tracing is important to understand the fan out of the system
33:10 Didn’t have any way to do cross-language context propagation or to do tracing between them
33:40 Multiple language make consistent structured logging due to dependencies on different standard libraries
35:30 Logs can be parsed and structured by log systems like Splunk Elk and then for human consumption afterwards
36:25 Errors can generate a significant flood of errors, and may even flood your logging system
37:00 If you can’t keep up with the logging flood, then you should just drop logging data. It hurts, but doesn’t hurt as much as the logging system causing production problems.
37:35 It is expensive to analyse the logs in the systems it’s not clear how to pay for the logging infrastructure and to index the logs, or which team is going to pay for it.
38:20 If you have accountability for the logging generated to the teams that produced it, then you can control the costs associated with excessive logging
38:55 Need to load test against production, test with peak load against off-peak times.
39:25 Load tests in production mess with metrics for real production. Need to denote a test as a test request so they don’t impact the business metrics and telemetry.
40:15 If we had a way of propagation to indicate that a request was a test request you can skip the metrics or logging aspects
40:45 Failure testing is important, and adding it retroactively is a challenge.
42:45 Everyone is always migrating, and mandates are bad. There has to be a big win for everyone using a carrot not a stick or it will be really hard.
43:40 Buy or build? It’s a tricky trade-off and not a right answer. Anything that is a platform is becoming commoditised and will be an Amazon offering.
44:40 All infrastructure software is becoming commoditized
45:00 Service allow people to play politics. Politics happens when Company > Team > Self
46:10 Everything is a tradeoff; try to make them intentionally.

Companies mentioned

Uber
Amazon

People mentioned

Donald Knuth

Languages mentioned

Java
Go
NodeJS
Python

Products mentioned

Zipkin
Splunk
Elk stack (ElasticSearch, Logstash and Kibana)
Zap

References

“Premature optimisation is the root of all evil” - Donald Knuth in “Structured Programming with Go To Statements” Computing Surveys, Vol 6, No 4, December 1974