BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Panel: Living on the Edge

Panel: Living on the Edge

Bookmarks
40:04

Summary

Jose Nino, Rita Kozlov and Ivan Ivanov discuss when we need to care about edge optimizations, what the development workflow looks like when on the edge, and some of the challenges.

Bio

Jose Nino is Staff Software Engineer @Lyft. Rita Kozlov is Product Manager @CloudflareDev. Ivan Ivanov is Engineering Manager on the CDN Reliability team @Netflix.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Fedorov: Welcome everyone to the discussion panel for the API acceleration and edge computing track. From today's presentations, we've learnt a lot about architecting, and making our systems to optimize for data delivery, with solutions to bring compute and storage closer to the edge to route the traffic effectively, or to optimize the most of the network protocol. We have three fantastic speakers who would share the experience of what does it mean to actually run the systems that operate at the global scale that leverage the edge? What are the use cases? What are the challenges? Where is the future? We'll focus on the topics of use case, observability, challenges, and issues. Hopefully, we'll share a little bit of forward looking perspectives.

Background, and Definition of Edge

Would you mind just giving a quick intro of who you are, what you do. What's the definition of edge? That's a fascinating term. Quite often, there is no single term. What does edge mean to you in terms of the context and domains that you work on?

Nino: I am Jose Nino. I'm a staff software engineer at Lyft, which is a ride sharing company in the United States. I've been working there close to five years now. The entire time I've worked on our networking infrastructure. For a little bit over half of my time at Lyft, I worked on the server networking team that built an open source Envoy proxy. During that time, we managed all of our network traffic all the way from the edge, to our service mesh, and even to our egress traffic. During my time on the server networking team, we defined the edge as the first ingress point into this infrastructure, at the moment where we leave the wild internet and go into our data center. It's the point where we start controlling that infrastructure. Over the last two years, I've been working on a new project out of Envoy called Envoy Mobile, where we're pushing the same networking stack that we have on our server infrastructure, out to Lyft's mobile clients. We're pushing the definition of what the edge is by deploying the same networking stack that we have in our servers to our mobile clients. My definition of edge is a little bit fluid at this point. I would push it to the point where the infrastructure that we control is closest to the end user. At this point for Lyft, that's the mobile clients.

Kozlov: My name is Rita. I lead the product team here for CloudFlare workers, which is a CloudFlare serverless platform that runs on our edge, which for us means basically, a network of data centers spanning over 200 locations around the world.

I think, like everyone, I find the term edge really interesting because to me, it implies that there's a whole, and then something cuts off somewhere. The first time I heard that term, I thought, the edge of what? It's really interesting for me to hear Jose think of the edge as going as far as the client, because the way that we generally talk about it is in the dichotomy of client versus server, and the edge being the way that you can get as close as possible to the client without being on the client itself, just by being at a data center that's milliseconds away.

Ivanov: I'm an engineering manager on the CDN reliability team at Netflix. This is a content delivery network. Our team is focusing primarily on resiliency, availability, quality of experience, amongst other things. If I have to simplify what we do, is, we are the team who actually make sure that if you open Netflix UI and click on a title, and you want to play a movie or a TV show, it starts and it finishes without issues and with the highest quality possible for you.

Edge, in my case, would be the content delivery network, all the servers that we have distributed in the world. When I think about it, it can be servers present in big internet exchange locations in Europe or North America or South America, as well as servers deeply embedded in our ISP partner networks. We can think that we have servers in the Amazonian jungle, close to the crater. If you look to the South, we have servers in Tierra del Fuego, the Island on the tip of Argentina all the way down to South America. Then, if you look north, we have servers in Island of Svalbard, it's a Norwegian Island, which is roughly close to the polar, close to Greenland in the Barents Sea. The edge is literally like the set of infrastructure that we control and we maintain, which is usually closer to our users.

Fedorov: Ivan, can you give an idea of the number of locations, like how wide is that edge?

Ivanov: I forgot to mention, it is thousands of locations and more thousands of servers in those thousands of locations. It's pretty broad.

Challenges of the Edge from an Operational and Development Perspective

Fedorov: It sounds like in our case, the edge is something that's as close to the users that we control. Generally, it can run into hundreds, thousands, or even millions of points of presence if we go into this next evolution, because I mentioned about pushing the edge to the devices. When we think about the scale, and when we think about this breadth, this is a different concept than running in a traditional data center, when you have a few regions and locations. From an operational and development perspective, what are the main challenges or issues that you will experience? Can you name like, what's the first thing that comes to mind? What has to be different, or that's challenging, and you have to really focus on, on your day-to-day work?

Ivanov: When we are talking about distributed infrastructure, you also have to think about cases when multiple dependencies are in the picture. When your applications and services have to communicate with other applications, services over the internet, that's a challenge. The other thing to consider is wide distribution across multiple networks, sometimes creates connectivity issues with those endpoints. When you're designing your application services, the infrastructure, you have to be aware of those challenges and you have to keep it in mind, make it part of the design to start with.

Kozlov: I completely agree with that. It's interesting because for us, the edge and the developers on the edge have two meanings, one set of developers are the developers at CloudFlare who build out the edge and make it resilient and themselves ship products that are built on our network. For me, my end user is any developer out there who's interested in building their next application. I think the challenges of the edge have to be in that way contextualized against the benefits of the edge, which, by contrast, another type of architecture that you can choose is one where you rely on a singular data center. In that instance, the things that you're concerned about are latency, if your users are on the other side of the world, and disaster recovery, resiliency. Just by nature, the edge is much more resilient because it's designed for failure, you have basically infinite failover. Those are the benefits that you get.

The thing that you trade for those things, for not having to think about regions and load balancing and how to really attain that resiliency and performance is, as Ivan mentioned, thinking about state, and distributed systems get complicated when you're running in three or five data centers. How does all of that work when you have 200 data centers? That's definitely, I think, conceptually one of the most challenging thing. We try to abstract that as much as possible so that the user can design their things or think about their things in the same way as they would if they were running in a singular data center and still get all of those benefits out of it. That's certainly the biggest challenge.

Complexity of the Edge with Billions of Locations

Fedorov: I wonder, Jose, like in your extreme case when you have millions, if not billions of locations, what will be the most complex part?

Nino: We think about it, similarly to what Rita was saying, that we think about our end users being twofold. One, the people walking around the streets trying to get into cars, and the people who are in cars trying to get passengers. On the other side, the engineers that have to operate this pretty complicated, distributed system. On the former case, for the end users, what we want is a seamless experience. Part of that, with what we've been working with, unifying our client-server interaction, is the ability to innovate at the protocol level and at the network level, to squeeze performance out of that interaction. Similar to what the Facebook engineers were talking about with QUIC, we're now entering the phase of experimenting with the performance increases that we can see there.

Then on the operator side, on the engineer side, the thinking there is what tools can we provide to make the technology as consistent as possible so that every hop of the network, they can see the same set of behavior, the same set of configuration, and the same set of observability, so that people, both in steady state, and when problems occur, can quickly assess, what's the health of our network? If problems occur, where are they occurring? Are they occurring all the way from the mobile clients out on the open internet because of carrier problems, or into our data center through the traditional server edge into our internal services? A consistent set of those three dimensions that I mentioned, really reduces the cognitive load and allows us to operate this pretty complicated system a little bit better.

Justifying Complexity and Investment in the Edge

Fedorov: I think so far, I'm hearing two patterns. One, doing things on the edge, it has complexity. You have to understand that it's way more than in a traditional data center. There are ways around it. Another point that was mentioned is that it should be worth it, because the investments of going into the edge should be worth investing into the edge, like changing the engineering practices or operational practices. What can you share about the use cases? What are the killer applications that you're familiar with that actually justify all of this complexity, all the investments by investing into the edge infrastructure?

Kozlov: We've gotten to really see the evolution from the edge being considered a CDN and a relatively dumb mechanism for caching things, into allowing users to run code on it. At first, the use cases that we were seeing were an iteration on, how can I cache things more. Things like adding headers, or maybe doing some basic load balancing, or slightly less than basic load balancing and doing it based on more parameters. One of the first couple use cases that I thought were really interesting, first of all, authentication is a really good one, because, in a way, it's also an iteration on caching. It allows you to cache things that previously weren't cacheable at all. What we see with many customers is they will have a use case where they have a free service and a paid service, and caching free things is really easy because anyone has access to it. Caching paid things is really hard because you have to determine whether or not someone needs to have access to it, so you're creating a choking point at your origin, which is challenging for two reasons. One, you're delivering a worse service for your customers, because the content is getting pulled from a location far away. It's not cached. Secondly, it's a difficult engineering problem, because now you have to figure out how to scale that origin that's meant to serve millions of people. Being able to push the authentication up to the edge means that you can actually now start to cache these assets, to take things another level further, and really demonstrate where the edge fits in between the client and the server.

Another really great illustration is A/B testing where, again, each side of things has its own downfalls. With the client, you would run into things like inconsistent experiences. If you're running an experiment, and you see a blue screen flash for a second, and then it turns red, that's not great for the user. The other option you have is to block all the rendering and present the user with a white screen until you're able to render the whole experience. That's not great either, especially now that everyone cares about web core vitals and time to interactivity, all of that. You can put things on the origin, but again, now you have to travel there every time. Also, you have to maintain SDKs. Really, that concept of going back to the first question, like what is the edge and getting as close to the user as possible, but where you still have control. Is really perfect for that because you can determine which experiments someone needs to be in, pre-render it, and just have it work out of the box for the user.

Now more and more, we're seeing whole applications being built on the edge, and we really think that, over time, we'll see it becoming less of a niche thing for these types of use cases, in the same way that it used to be standard to build on-prem and now everything's in the cloud. The next stage of the cloud is the edge.

Fedorov: I think that that's the part where I really like the elegant presentation about durable objects, and basically pushing the data plus compute to the edge, and making it more self-sufficient versus just being an add-on to the data center infrastructure, and completely embracing that from the main architectural standpoint. It's always great to see what you folks are doing on CloudFlare in terms of pushing the boundaries.

The Killer Use Cases for the Edge

Ivan and Jose, what are your killer use cases for the edge?

Ivanov: For me, in our case, our application has one main customer, like one single intention, it was designed to deliver video in the most performant, most efficient manner. In our cases, we have been working on optimizing the application to the point where we now have to use servers capable of delivering close to 100 gigabits per second from a single server. That was the starting point of our main goal on the content delivery network. Slowly and surely, we have extended that on the application layer, for cases where we have more personalized content API calls and things that are by definition non-cacheable. If you have seen Sergey's talk about how we are optimizing dynamic calls from end clients using the edge infrastructure, offloading the TCP and actually optimizing the TCP and TLS handshake. Even further using a purpose-built backbone to route the request over a more reliable network path from end users to the main data centers versus using the internet. Those are very exciting cases when something that's not cacheable by default, you can still optimize. It's seems better with that approach.

Nino: For us, it's a mix between the two, of a very purpose built solution, but also the flexibility of what Rita was talking about. It's all built on this North Star of consistency, because we have a pretty small platform team for mobile networking, only four people. If we want to do things like the Facebook engineers were talking about with QUIC, it took them a year to implement, and two years to roll out. For a four-person team in a company much smaller than Facebook and Google, it's much more difficult to do experiments like that. Our idea with pushing the edge to the clients, and by that I mean deploying, basically the same code base in both iOS, Android, and then in the server, then we shave off both implementation time and rollout time, because we can reason about the code base just one time and roll out experiments.

This applies to protocols. This applies to a use case that we did maybe six months ago, was we wanted to experiment with our compression between the client and server. Prior to Envoy in the server and Envoy Mobile on the client, we would have had to implement compression algorithms and deploy them to iOS, to Android, and to the server. For the experiments that we did six months ago, we wrote it once, obviously, both the compression and the decompression side, but the code base was unified, and we rolled it out within a week, and saw dramatic improvements in latency and success rate because of the payload decreases. For us, the use case is this consistency enables us to even start thinking about other more complex things we can do, because we've really reduced the amount of effort that it takes to implement new paradigms and deploy them.

Fedorov: Actually, a very good point because, Jose, in your case you mentioned the part where you want to reduce the effort to roll out some of the network changes to the clients, and you have a diverse set of client applications. I think at Netflix, in our case, we pushed things to the edge, and we decided, we'll just keep QUIC clients as they are, or make as few changes as possible. We'll just leverage our edge that's very close to the users to have full control, end-to-end. I like your idea of going only and just pushing the small piece of functionality and making that component on the client, and having control there. In our case, we also have to deal with 10 years old TD, so it's not [inaudible 00:22:16] solution. I think that direction, this thought of enabling the future innovation, I think is quite powerful.

Cases Where the Edge Shouldn't be used

It's always fun to hear about good use cases for stuff like that, especially around companies with big infrastructures, and successful businesses that wants to use edge. What are the cases when we shouldn't use edge? Are there any cases where it's just not worth the effort and complexity to invest into that?

Ivanov: In my view, there are at least two or three different distinct cases when using the edge is just not worth it. It's not that you cannot do it, it's just it's not any way better or preferable. Cases like that would be any type of non-latency sensitive, like any type of batch processing, or any type of large compute operations, when you need to have machines, dedicated hardware with large amounts of memory or a big number of CPUs capable of processing the data really fast. For those cases, distributing and getting it to the edge is just not worth it. The other thing is, any time when you're working with critical data, where it needs to be consistent and not eventually consistent is the only thing I can think of. I tend to think of any type of thing, something quite related to finance or banking or even stock trading, those are things that may require immediate consistency right away, versus the eventual consistency that you will get if you distribute your workloads to the edge, by its nature. Those are the main things that I can think about.

Kozlov: I think it again goes back to how you think about the edge, because what we've learned is because we built out this network, anything that's on the network to us is the edge. You can also think about that network in different ways. Going back to what Ivan was saying about things like batch processing, or really compute heavy tasks, we were originally in the same camp of those belong in the data center somewhere, not on the edge. The really interesting thing about an edge that's built out to serve content to users as close as possible to where they are, is that it's bound to the natural day cycles. When it's nighttime in the U.S., all of the compute that we have sitting in the U.S. is sitting there idle, the same thing when it's nighttime in Asia and Australia or in Europe. What we found is that if you think about the edge not as what's closest to the user, but of where you have compute available, then all of a sudden it becomes feasible to run things like more computationally heavy tasks there as well.

The other thing is, I don't think that there's a wrong use case. Again, it's about thinking about the challenges and embracing them. It's always a tradeoff of how much do you want to reap the benefits of the engineering work, of moving things to the edge? The old adage in engineering is like, if it isn't broke, why touch it? The other thing that we always get questions about, is my legacy application suitable for the edge? Probably, you can't just take it and move it. It's a question for your organization, is the engineering effort going to be worth it? What we have seen work quite well in those situations, is setting up a proxy on the edge and slowly moving services over one bit at a time. I would say definitely, not a good use case is taking what you have and trying to lift and shift it. That's just not a good practice in general, because those types of projects tend to not go as quickly as you want them to and start up for long periods of time. You just have to sometimes think differently about them.

Fedorov: One thing that I just thought of, of probably not a great use case, is, I think, ultimately, the tradeoff is between edge and data center. Edge is smaller. You have way more but smaller units. If an application requires a state of global knowledge, if we needed all the data in the same place, and probably passing the data all over the network until one of the edges, is a little bit wasteful and it's meaningless. It probably goes much more into the batch processing use case that Ivan touched upon.

Jose, what else have you learned from the Lyft experience?

Nino: I think the summary point for me is starting with the question and see if bringing complexity to the edge is the answer. Many things in our conference and our panels, is like, we had a pretty clear problem that we wanted to solve and the solution was what we did, at least at the time. Yes, less than chasing technology, we would chase the answer to our problem and see what fit it. It happened there are more complexity on the edge.

Mitigating Complexity

Fedorov: Thanks for bringing complexity back into the conversation. Let's chat a little bit more about mitigating complexity, because no matter how great that use case is, it may require quite a bit of a higher effort to operate the system. In many situations, even just running things in a few data centers is already difficult just to figure out what's going on, if you have millions of other things that you should do in terms of your operations or visibility to it. Jose, since you mentioned it, what have you learned?

Nino: I touched briefly a little bit before with this North Star of consistency around configuration, behavior, and observability. I think what I would like to highlight now is observability. The example that I like to use is when we put Envoy Mobile on the clients, what we got there was a consistent set of time-series namespaces, that we were already used to operating with, in the data center. Now we had them all the way from our mobile clients. Back to what I was talking about, mitigating cognitive load for operators, if the time-series data that they get about the health of the network is the same with literally the same namespace, all the way from the clients into our server edge, into our service mesh, out into our data centers, then we get a pretty consistent view of what is happening with our request flow. That has really benefited operators including myself. I operated Lyft's API gateway for a long time. It was much easier to reason about it because we had the same time-series data, explaining the behavior of our network, end-to-end. To us, mitigating complexity comes from consistency. Because it just reduces the amount of cognitive load that you have to go through and reason about when you're dealing with these complex systems.

Fedorov: Would it be fair to say that you basically embraced and focused on that from the point when you proposed the design of the system?

Nino: Absolutely. It was really one of the main guiding posts that we had. We knew that we had to make it consistent if we were going to add complexity, because if we added complexity and made it inconsistent with all of our other behavior, it would just be too much.

Ivanov: Plus 1000 for uniformity. It's one of the things that helps us be able to support and work on a really vast infrastructure with a very small team. We do push and we work hard in making sure that all of the servers that we're operating are configured in the same way. That they're providing telemetry in a similar way. Near real-time instrumentation is done in a similar way, so we can reuse and we can build upon on the same toolset for both observability and management of the fleet. The other thing is automation. It's why there are failure cases. It's like the obvious thing, you do not want to have humans execute tasks that are repeated. You can go from automated when a human can actually make a decision and make the automated system to execute something or fix something or change something, to fully automated, when the system is taking care of it. You don't need to be interrupted, and you don't have to do anything about it. Keeping it in a more concise way, is the key there, as well.

Kozlov: From the standpoint of someone offering a platform for others to build on, the biggest complexity is just cognitive in terms of shifting the mindset and learning, basically, a new skill set in terms of the runtime and how to think about it. The biggest way for us to mitigate complexity, going back to what Ivan and Jose said, is providing the tools for you to be able to have the observability and the automation that you need. Really, it's a lot of investment in developer experience at every single stage of the way to also remove the users from having to invest in all of that themselves, and be able to benefit from that out of the box.

A couple years ago, we launched Wrangler, which is our CLI, and helps you build all of that stuff into your CI system whenever you deploy. We try to show customers how to build logging into their workers so that they're able to manage that as easily as possible. Where we're looking to go next is taking things a step further and helping users take advantage of what we learned running software on the edge for so long. At the end of the day, there are things you can control and there are things that you can't control. How do you allow developers to test their stuff in production, but at a lower risk through deploying things to canary first, and really take advantage of those best practices?

Do Developers Need to Care about Where Their Code is executed?

Fedorov: I think, Rita, an interesting point that you mentioned about the developer experience. At the edge, at what point if any, do developers have to care where their code is executed, like which location? Because I think in a typical data center world, you sort of know, I'm running things in AWS in USA, in the region A, B, and C. In that case, do I have to be aware, or if yes, in what cases?

Kozlov: Ideally, you wouldn't have to care. Actually, one of the things that we do as we onboard you to demonstrate that point, if you think about it, the inverse is arcane in and of itself, because the very first thing that you do in most cloud providers when you deploy something new is you have to choose a region. When you get started on our platform, somewhat cleverly, we say, the region is earth that you're deploying to. The idea is to remove thinking about which data center you're deploying things to away from you. With things like durable objects, we again want to take that thinking out of you and enable use cases where consistency is really important. Even though you're technically writing to a single data center, we can even take the management of which data center out of you, so today, when you create a durable object, it gets created in the data center that the eyeball request was basically made from. We can even move that data around to where it's being most frequently accessed from, which is, at the end of the day, what you want. To use Lyft as an example, if I'm trying to get a ride, and me and my driver need the information about our whereabouts, that doesn't really need to travel to a location in India, if I'm in North America. That should actually all stay within the same, probably San Jose data center would be my guess, if I'm connecting from San Francisco.

The one case in which you do have to think about it and we're encountering more, is as everyone is moving towards having their own data sovereignty laws, that's where we're seeing developers care about this. Things like GDPR, everyone has their own policy. That's where we want to make it easier for you to define more the restrictions around where your data is, rather than having to think about coordinating it yourself, if that makes sense.

Fedorov: That's actually a really good point about some of the cases from a security, legal standpoint, you still have to be aware of.

The Evolution of the Edge, Looking into the Future

I'd like to wrap up with a little bit of the future looking perspective. In terms of the edge evolution or edge technologies, what's the part that you're the most excited about going forward?

Ivanov: There are two things that I'm really excited about, and I have been working on for some time now. It's the correlation between near real time telemetry and big data, and using all the advantages that we have now to make our platform better, and maybe even introduce things like machine learning or artificial intelligence, specifically to support the infrastructure on our behalf.

Nino: I think for us, the work that we've done over the last two years in my team to build and deploy Envoy Mobile to our clients, we're now coming to fruition on table stakes, which is having client and server speak the same code base. Really, our imagination is the limit, because we've reduced the amount of effort that it takes to build, test, and deploy whether it's new network protocols, new compression algorithms. Now we're moving on to a phase where we can really start acting on that. We can go and hopefully deploy QUIC faster than Facebook did, and enable others at a medium size to do that, just by reducing the amount of both cognitive load to operate and also engineering work to develop and deploy.

Kozlov: Obviously, there's a massive infrastructure team at Netflix, and Jose, it sounds like your team is doing all this work to get this technology to Lyft. What I'm really excited about is making all of this available to any developer starting out anywhere, and having this enterprise grade infrastructure and architecture be available to anyone who just might build the next Netflix or Lyft. Not to introduce competition.

Fedorov: It's really a good point to democratize access to the edge technologies to pretty much everyone: every company, every developer.

 

See more presentations with transcripts

 

Recorded at:

Mar 25, 2022

BT