InfoQ Homepage Presentations Optimizing Efficiency & Capacity Management at Web Scale on the Cloud

Optimizing Efficiency & Capacity Management at Web Scale on the Cloud

View Presentation

Speed:

36:30

Summary

Molly Junck shares insight on how Pinterest optimizes their use of the cloud, concurrently maintaining demands for security, availability, rate of innovation, and infrastructure efficiency.

Bio

Molly Junck is a Technical Program Manager, leading the Infrastructure Governance Program at Pinterest. Molly is responsible for supporting Pinterest’s capacity management, cloud infrastructure cost and usage data, as well as managing the relationship with Pinterest’s cloud provider. Prior to Pinterest, Molly worked at Adobe where she led the Security Testing Program.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Junck: Halloween was just a few days ago, so I'm going to start today's presentation with a scary story. Let's imagine you work for a company that operates on the public cloud, and runs one of the largest cloud workloads in the world, spanning hundreds of services. Your company's top priority is availability and user experience. Much further down that list is cloud efficiency. Part of your job is to keep those cloud costs under control. The company culture allows for any engineer to spin up as many machines as they want at any moment. That's the situation that my team and I find ourselves in at Pinterest every day, and sometimes it can be a daunting or a scary task. The purpose of my talk is going to be to share some of the strategies that we've used to successfully scale in the cloud that may hopefully be useful for your companies as well.

Background

My name is Molly Junck. I'm a Technical Program Manager of the Infrastructure Governance Program. I work here at Pinterest, where I've been for the last year and a half since April 2020. Before Pinterest, I worked at Adobe on security and release engineering projects. What I'll refer to in this talk as my team, are the other contributors to the Infrastructure Governance Program, which is a cross-functional group from finance, engineering, and data science and analytics. I may refer to infrastructure governance as infragov, but this is the same thing.

A little bit about Pinterest. First and foremost is our mission. Our mission is to bring everyone the inspiration to create a life they love. The company was founded in 2010, and we've now surpassed over 2500 employees. We also operate one of the world's largest cloud workloads, which makes it a really exciting place to work. At our current scale, we're serving almost half a billion users every month across the globe. We have over 300 billion Pins saved. In the U.S. alone, we reach 41% of internet users every month. Our cloud provider is Amazon Web Services, and we're a cloud native company. You may hear me refer to AWS technologies such as EC2 instances for compute, or S3 for file storage. Many of the strategies that I'll be discussing apply regardless of what cloud provider your company uses.

Challenges of Scaling

One of the challenges that organizations face as they grow is that they face a different class of risks when it comes to controlling cloud costs. Unlike smaller companies, where generally you probably have smaller staff and can operate a little more nimbly, larger companies become like slow moving ships that require significant planning and coordination for any major change. Architecture complexity often increases as well with scale, making predictions a lot harder when you're spinning hundreds of services versus maybe dozens of services. Sometimes, unexpected challenges that companies also see is that the typical principles of infinite cloud scalability that anybody can spin up as much as they want at any time on-demand, can break down when you're operating at scale. If we were to suddenly change the characteristics of our EC2 fleet in a major way without letting AWS know, it's possible they may not have that volume of capacity just sitting around waiting, and we may run into shortages. Larger companies may also have other factors to consider, such as moving from single region to multi-region, which poses its own challenges.

Outline

In my talk, I'd like to cover three key strategies that we've used for managing our cloud resources. First of all, the foundation will be providing good data. Second, providing guidelines and strategies for how those cloud resources are used. Third, is not to micromanage your way to cloud efficiency, but using incentives instead to achieve your organization spend goals.

1. Provide Good Data to Make Good Decisions

The first key strategy that I'd like to go over is providing good data to employees so that they can make good decisions. At Pinterest, we aim to make data-driven decisions whenever possible, and data-driven decisions that take cloud costs into account. Cloud costs require a trustworthy, timely data source. Providing this data source has been one of the core focuses of the Infrastructure Governance Program.

Pinterest's Infra Cost and Usage Data

At Pinterest, we provide our employees with this data in a set of tables that we call the BI Infrastructure Governance Data Mart. This data allows our employees to answer questions such as how much a given service costs, or whether their service's use is in line with the targets for that service. This also powers internal dashboards sent to leadership about how much each organization is using or spending in terms of infrastructure. Two organizations within Pinterest have also embedded this data into experiment dashboards so that teams can factor changes to infra costs when deciding whether to launch an experiment, for example, if it significantly increases or significantly decreases their infrastructure cost.

The data source that we use is comprised of both internal and external sources. The major external source is the AWS cost and usage report, or CUR report, which is a detailed line by line billing item of all of the resources that we use. Internal sources include things like multi-tenant platform records. For example, our internal Hadoop platform for offline data processing, our configuration management database, or the CMDB that contains all of our internal EC2 usage records, and organizational data such as LDAP, or service ownership information. We also use custom blended cloud resource rates, which factor in details like custom pricing, and the typical ratio of reserved instances versus on-demand that we achieve for that given instance type. At Pinterest, we found that constructing our own internal data source was the best way to accomplish providing visibility for employees. This is by no means the only way that other companies have to accomplish this. For example, we've seen other companies address this via AWS tags or metadata from higher order services.

2. Provide Guidelines and Strategies for How Cloud Resources Are Used

The second key strategy that I'd like to go over is to provide centralized guidelines for how those cloud resources are used within your company, so that you can simplify your company's architecture needs. At Pinterest, the Infrastructure Governance Program provides oversight for the entire cloud footprint. While we don't mandate that every team follow these guidelines strictly, encouraging teams to align on these standards has been really critical for us.

Define Your Capacity Management Strategy

The backbone of your company's strategy as it relates to compute especially needs to be a sound capacity management strategy. To help inform your compute capacity strategy, it's important that if you've got a central team managing this, that they're decently connected to the technical needs of the services that are most important to your business. The other critical angle here is that it's not a one-time exercise that you create a strategy and then move on, you need to be revisiting this on a periodic basis so that it is current based on the technical needs of your services at any given time. This requires good bidirectional feedback with service teams about where they're planning on going with their architecture and technical needs.

With the detailed knowledge of the services technical needs, simplify your architecture by standardizing. For example, here at Pinterest, we aim to simplify our compute fleet by suggesting that most workloads standardize on one of three to four top EC2 instance types out of the dozens that AWS provides. Even if an individual given workload would be more efficient on some instance type, centralizing on just a few that we use across our entire fleet gives our finance team the ability to make large purchases, because our total capacity numbers end up remaining a little steadier, even though the capacity needs of the workloads underneath them may significantly vary. This also gives us more flexibility from an availability standpoint, because we rely on large pools that can be transferable between services. This doesn't mean that our entire fleet runs only on these three to four instance types, but we strive for only specialized workloads to run on specialized instance types.

Furthermore, it's important to determine governance mechanisms to ensure that your company's growth is in line with your company goals. Choose a strategy that works with your company's culture. Here at Pinterest, we have a capacity request process, which allows for organizational leaders to approve requested capacity that meets certain thresholds to ensure they know what's changing within their organization, and that they have an opportunity to push back if that's not in line with the growth that they're expecting. Engineers can also bypass governance controls here at Pinterest if needed, because, for example, if there were an incident, they may need to respond and spin up instances. It is part of our company culture that we tend not to put in controls that prevent people from making changes that they see fit in.

Beyond controls within your organization, strong bidirectional communication with your cloud service provider can be really important, especially when operating at scale. One of my favorite quotes about the cloud is from an engineer on my team that works on our cost and usage data, Elsa Birch, who likes to say, the cloud is like sweatpants, it's elastic until it's not. We've run into some challenging scaling problems that don't really apply when you're a smaller company. It's important that we treat AWS like a partner in our major architecture decisions, which both helps them plan for the capacity needs that we will have, and also allows us to predict challenges before we face them.

One major example of a capacity challenge that we had was back in 2019, so two years ago, at that point, we had successfully standardized most of our fleet on just a few instance types. The company decided to make a big push that you're to upgrade from a slightly older generation to a slightly newer generation, which had better performance metrics across the board. We worked in close communication with our account team at Amazon, and migrated teams throughout the years, the instance pools were ready for it. Going into the end of the year, we thought that we were in a good shape with our instance pools, however we hadn't fully anticipated the organic growth those services would see over the holiday period. At Pinterest, October through December is a really critical time for the company because we see a major traffic bump, when a lot of people come to Pinterest to find Halloween costumes or come up with recipe ideas for Thanksgiving, or to complete their holiday shopping.

We ended up seeing major spikes in especially the instance types that autoscale heavily, and we ran into shortages of this newer generation, which forced us to define fallback procedures to go to other instance types or older generations. As a result of this incident, we maintain fallback procedures today for any services which autoscale heavily, and we communicate any changes in capacity that we're expecting much more closely with Amazon to avoid future shortages.

3. Don't Micromanage, Incentivize Optimization and Efficiency

The third, and what I'd say is the most important key strategy is that managing cloud resources and spend is a journey, not a well-defined path. When our finance team sets targets for our infrastructure spend, that's based on the needs of the business and the industry, we don't have a clear map of exactly how we're going to achieve those spend goals. Pinterest's core function isn't optimizing our workloads. We need the vast majority of our engineers to be focused on new features that bring more inspiration to the world. However, building more efficient systems and reducing the costs of current workloads, creates the financial headroom for us to make big bets and try new things. At Pinterest, we've found that a small investment of staffing focused on efficiency, often 5% or less, can yield major returns.

Lean on Subject Matter Experts to Achieve Cloud Spend Goals

Today at Pinterest, we've built the organizational support for infrastructure goals through the nomination of what we call spend captains. Spend captains are representatives from each major organization within the company that are responsible for serving as infragov's point of contact for cloud spend for their organization. Spend captains are responsible for partnering with infragov to define organizational level budgets, are responsible for keeping their organization's budgets in line with their targets, and promoting optimization opportunities to create that headroom for their organization to grow. They're often relatively senior folks within their organization so that they can have the appropriate level of influence on their roadmap. We would not be able to achieve our spend goals without the partnership from spend captains. They're really critical to our success. We also rely on engineering teams to crowdsource ideas for efficiency projects. While the infragov group aims to have a baseline understanding of our cloud needs, we're not subject matter experts for each of our services, or each of the AWS technologies, and we rely on engineers to share what efficiency opportunities may exist within their domain. Then we rely on the spend captains to help understand the ROI of each of those, and to support building in the engineering time to make that efficiency work happen.

Then, finally, we lean on organization leaders to celebrate and incentivize efficiency projects. For example, one way at Pinterest that we incentivize efficiency work is something that our head of engineering, Jeremy King, often does. We often have AMA sessions with engineering leadership, and he often kicks those off highlighting major efficiency wins from teams that have saved some number of millions of dollars by focusing on a change to our data transfer or something like that. Creating visibility for this work demonstrates to other engineers in the company that this is really important to our success, and helps to incentivize future engineers to prioritize efficiency projects.

Recap

To recap the three strategies that we covered. We anchor on Pinterest, providing trustworthy cost and usage data, so that engineers can make good decisions. Providing centralized guidelines to simplify our architecture, and to lean on subject matter experts, so that we're not micromanaging how we achieve our cloud spend goals.

Questions and Answers

Watson: What optimizations have been most effective at Pinterest, in your mind? I know they're all over the place but what do you think have been the most efficient optimizations?

Junck: As an organization, you need to balance optimizations by impact and complexity and difficulty, and ideally try and prioritize the ones that are highest impact and lowest complexity. You need to attack them from a multi-pronged angle. You need to have a multifaceted strategy, and from our side, some of the efficiency projects that have had the most impact. One example is an AWS technology called S3 Intelligent-Tiering, which is just a setting that you can have to have AWS automatically change the tier. That's been very high impact and low effort. It's something that is like turning a dial, and then suddenly you see a lot of savings come in. We also have financial efficiencies. Something that our finance team is adjusting the levers of, like reserved instances or savings plans so that we can make some commitments about the capacity that we are sure is going to stay around at a reduced price.

We also have a group that tracks regressions. For example, our data transfer savings, we've found that if we don't have a close eye on those that we need to just check in on about a monthly basis to make sure that we don't have any projects that have suddenly changed a setting and have unexpected cost. Then there's also efficiency projects that are a little harder and require more complexity, for example, rightsizing instances where you have to really go in and understand the service, potentially perform load tests and make sure that you've got the right instance type for the workload that's running on it.

Watson: Does Pinterest have autoscalers, and what are the key metrics on how resources get scaled up or down depending on traffic?

Junck: We definitely do have autoscaling. There's no single autoscaling policy across Pinterest. There are hundreds of services, and some of them that respond to traffic autoscale heavily in response to demand, whereas others that are, for example, stateful systems tend not to autoscale. There's no single answer about how we autoscale as a company.

Watson: Have you considered including carbon cost alongside financial cost so we can be carbon efficient?

Junck: That's something that we are striving to do. That's something that Pinterest cares about is our carbon footprint, but we're not there yet.

Watson: How do you manage scenarios where developers forget to tag EC2 instance properly, resulting in infrastructure estate you don't have insights into, in terms of what actually happened.

Junck: One of the interesting things about what we do at Pinterest is I know a lot of companies use EC2 tagging to track their infrastructure. We actually don't have a unified tagging strategy across the company, so we use our own ownership tool that we internally call Nimbus. We map along with the CMDB, and then the Nimbus project, who owns what. We still run into the same problem that you're talking about, about some instances not getting tagged appropriately. Right now we're more reactive than proactive, but because we do have this really high quality cost and usage data, hopefully somebody notices when something changes, and suddenly they're charged a lot more. That's that distributed accountability that helps us keep on track with our spend.

Watson: Why do you focus on EC2 so much? Are no managed services used, and if they are used, are they shared by more than one team, and how is this calculated?

Junck: Pinterest as a company is about 10 or 11 years old. That was pretty close to the beginning of AWS. I think AWS started in 2008 or 2009, or something like that, just when companies were starting to get on the public cloud. We're one of the few companies that is born on the cloud, cloud native. We ended up just building our own PaaS, and use a lot of just the raw AWS resources like EC2 and S3. We've got our own monitoring solutions, our own logging solutions, and less of those managed services.

Watson: Then, EKS's use, how do you track spend there?

Junck: I think this does get to a different point about like, what about all these other services? I'm talking a lot about EC2 and S3, and we have this very granular solution for all of our EC2 spend, but we do have some minor other solution or minor other services that we use that aren't a big chunk of our spend. That's something right now that just finance and the Infrastructure Governance Program keeps an eye on, and then reacts if they get to that size, where it would actually become problematic or something that we should be tracking all the more. We don't actually use EKS, we run our own Kubernetes.

Watson: We do have some services that are large enough we do track them, like I think hosted Elasticsearch or something like that.

Junck: We track it but we don't have the granular cost attribution yet.

Watson: Do you manage web application versions when deploying them on the cloud?

Junck: That's probably out of the scope of my team, or what I can answer. We've got a team that focus on web app.

Watson: When we're looking at attribution cost, do you tag instance usage by the version of the app running on that instance in, say, the Data Mart?

Junck: I don't think we do. The things that we're more focused on are like ownership and then service. That data allows us to do a whole bunch of things with it. Knowing who owns what has benefits, not just for cost and usage where my domain cares a lot, but also security and instance upgrades, and things like that.

Watson: How do you control capacity inside Kubernetes? Is it only about tracking EC2 consumption?

Junck: Where we are in our journey with cost attribution, is we built this Data Mart about a year and change ago, and so far our primary focus has been on EC2. This is something where there's a lot of benefits from having this granular data, and we're wanting to break it down more. We have a group of engineers that are working on expanding that to have also cost attribution for S3, and Kubernetes is pretty high on that list as well. Where we are today is that we have detailed attribution for Kubernetes as a whole, which is a big chunk of change that we then have to use other controls. For example, when our Kubernetes team needs to grow and they know what use case is causing that growth, we evaluate that on a case by case basis, and make sure that we have the funds to do that. It's something that we want to go towards having that same cost attribution and breakdown across all major platforms.

Watson: Ingress and egress costs make a significant portion of cloud costs, if we're running services which involve lots of data transfer, how do you track the network usage at Pinterest?

Junck: That's similar where actually the data transfer cost is something that we want to build detailed cost attribution for right now where it is in our spend trajectory, we've got that group that meets monthly to evaluate our data transport cost. That's our strategy in general. If we don't have the ability to have detailed granular cost attribution, we need some other controls in place to make sure that spend doesn't go out of control, and so far, that's worked well for us.

Watson: Maybe like data transfer, some of these are very hard to decompose, because they're shared by tenants. Imagine a datastore where many people are looking at the same tables, and that team wants to charge back the cost to their users. That's obviously challenging. What's your guidelines for where you draw the cut line? Is there some size maybe at which something needs to be where you invest in multi-tenant cost attribution, and we just have to hold the line of like, we're going to do this, but not do that. I'm sure everybody wants to charge back everything, even if their service is like this big.

Junck: If I can have a team of 30 engineers, we'd attribute all of the things and be able to have very granular data. There's two things. There's some things that the services can do to get ready for this. Because, thankfully, Pinterest has been on a growth trajectory, and so the size of each service continues to grow. The more services can have good logging data about who uses what, that's helpful for them to be able to attribute cost and track their growth before they actually have the cost attribution, is just to focus on who's using what.

Then, from my team that's working on expanding the cost attribution to other systems, we're looking at the ROI. We would ideally like to get a lot more multi-tenant platform systems in there, but we have to evaluate like, which systems have the most of benefit from cost attribution. Things that we think about when evaluating systems are like, how many business drivers are associated with the users on the platform? If it's just user growth, then hopefully their system is relatively predictable, but if it's five different factors, they're going to have a really hard time predicting how much they're going to need in the future and getting an accurate budget estimation. Other factors that we think about are how many users are on the platform. If they're dealing with 5 different tenants versus 500, they're going to have different levels of challenge of managing within their spend.

Watson: How do you track horizontal consumptions, such as, one team owns a service that's used by other users, how can you forecast future costs? The question comes back like at Pinterest. Of course, every company goes through a budgeting process, and I want to know how much they want to spend. How do you look at forecasting if it's challenging, by each service team?

Junck: We definitely do have some challenging situations. For example, I know Kubernetes was brought up earlier, and we don't have cost attribution for Kubernetes yet, but that's not even the worst of it. There's other multi-tenant platforms that are now building on top of Kubernetes, so we've got like a platform inception, and so estimating the cost for that can be really difficult. This is where it's a journey. We don't necessarily know how we're going to get there. We create budgets around now, actually, that are the best guess that we have for what we think that we're going to need next year. That's a combination of the service team's best guess of their usage trajectory, as well as the top-down, like how much finance wants to spend. Then throughout the year, we need to respond to new information and manage within those funds, or if something really comes up, of course, argue for additional funds.

Watson: Can business drivers be a factor in that? I know it's tough, because teams sometimes want to think they can estimate their cost, but it's challenging. Do you have future plans about how you tie growth drivers to capacity? I know that's a really hard problem.

Junck: Where we're trying to go is to become more predictable with our business drivers so that every service understands how they grow with some business driver. Then, ideally, we'd have really accurate projections of our business drivers, but things come up, like global pandemics, or who knows what that can throw a wrench into your plans. That's where we need to be nimble and respond to new information as it comes up.

Watson: Do you use any machine learning to pick up patterns and mapping those to costs?

Junck: One of the things that we have found is that machine learning isn't a good fit for the number of services that we have, at the hundreds of services levels. It's still something that works better with human models, but anomaly detection is something that we definitely want to use more of where it's like, if suddenly there's a big spike in spend, or a big drop, we want to understand why and be able to respond to that, and make sure that's on purpose.

Watson: On a scale of 1 to 10, how far away are you from migrating away from AWS and have your own data centers?

One to 10, I would say we're closer to 1 or 0. We do a lot of analysis around what it costs to run across multiple cloud vendors. We also do analysis of our capabilities as an engineering team. Imagine if you don't run any data centers today. I think I was talking to another company that has 40 or 80 network engineers to run their data centers, we don't have engineers. If you think about cost analysis, it would probably be a similar inversion of the question to, say Facebook, like when are you moving to the cloud? I know they're using more of cloud. When you start on one, it's actually very expensive to go the other direction. If you do a cost model of what it takes to build the equivalent of Amazon S3 in terms of redundancy and geolocation, that's a very big number. Right now, we're relatively low on the scale. You probably saw in the news, we just re-signed our contract with Amazon. It's worked out really great for us from an innovation perspective. Through some independent third party analysis, oftentimes people who start on-prem or cloud and they get very large, someone called ingress, egress cost, running in a hybrid config for even a few years and shipping across petabytes of data can just wipe out any savings you'd like to expect.

See more presentations with transcripts

Recorded at:

Jul 15, 2022

Molly Junck

InfoQ Software Architects' Newsletter