BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Create from Anywhere: the Netflix Workstations Story

Create from Anywhere: the Netflix Workstations Story

Bookmarks
38:31

Summary

Michelle Brenner discusses the studio Netflix has been building for their originals, the technology behind it and the challenges faced.

Bio

Michelle Brenner is a Senior Software Engineer with 10 years of experience in tech from engineering support to manager. She runs an interview format tech podcast called From the Source that examines what tech jobs are really like.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Brenner: I'm Michelle, and my pronouns are she, her. I will be telling you what I've been working on at Netflix since I started 18 months ago. First of all, this is not what I do. I cannot cancel or renew your favorite show. I don't check if you're sharing your password. I definitely do not carefully curate your recommendations, so you never go to bed on time. What I actually do is help artists make content. Artists are integral to creating the Netflix originals we all love. My team is responsible for helping to make sure they can work every day, all day, and not worry about the infrastructure powering their workstation.

Outline

Here are the questions I will be answering. What is a Netflix workstation? Why do we need them? Who are they for? What are their needs? How did we make them? Finally, coming soon, Netflix Workstations Expanded Universe.

What a Netflix Workstation Is

What is a Netflix workstation? Netflix workstations are remote workstations designed for the artists and engineers who create the visual effects and animations for Netflix originals. They need specialized hardware, secure access to petabytes of images, and various industry standard digital content creation applications. Netflix workstations are not remote workstations for everyone. Is it possible to acquire a workstation install Steam and start playing among us? Sure. Have I done this? Maybe. It's not the actual use case.

Why Do We Need a Netflix Workstation?

Why do we need them? Historically, artists have had machines built for them at their desks and only had access to the data and applications when they were in the office. This system allowed for fully on-prem solutions that stopped at the door. However, something came up that made working from the office no longer the solution. What could that be? I bet you all laughed sadly, and said COVID. That was a trick question. Everyone working from home put our project into overdrive. We actually started in 2019. Why? It's because Netflix wants to make all of the movies. Not all the movies, but at an unprecedented scale. I've worked in entertainment technology, almost my entire career, and never on this many projects. Someone astutely asked, are there enough people in Los Angeles to make all of these projects? The answer is, of course, no. To make all the movies, we need to hire all the talent wherever they are. Not everyone will be able to or want to come into an office, even as people start to trickle back in in some regions. It's about taking down traditional barriers and making Netflix an accessible place to work. We don't want to lose access to great talent because they live in Mumbai or have mobility restrictions.

Why Not Ship A Computer?

Why not ship a computer? Our North Star for Netflix workstations is to provide the infrastructure for artists to get a one click experience, you go from sitting down to working on a shot. That can also be accomplished by building a computer and shipping it to them. Why don't we do that? A couple reasons. First of all, flexibility. Film and television production needs can change fast: new cameras, new software. It's an ever-evolving space. By creating cloud based workstations, changes from features to security patches can be added quickly. A user can switch from one operating system to another by grabbing a new window instead of a new computer. Workstations like many cloud based applications are designed to be ephemeral. Artists are encouraged to pick up a fresh workstation and not worry about an old one going out of date. Of course, this has its own challenges, like changes interrupting workflows. The benefits are worth it. Another reason is security. Asset security is critical in entertainment. No one wants their unfinished work on Pirate Bay. By having the workstation and the assets on the cloud, the files stay self-contained. There is no downloading to a local computer. It also means being able to automatically back up, and control any assets created on the workstation.

Who Can Benefit from a Netflix Workstation?

Who are Netflix workstations for? I mentioned the artists, but they're not our only users. Three months into this project in March 2020, we got the mandate to get the workstations into the hands of artists as fast as possible. This led to a white-glove service. We spent a lot of engineering time on both customizing and cutting corners, which does not create the best experience. One of the things we learned is that we needed to shift towards a platform, instead of just a product to scale to the many different artist workflows. We needed to enable other engineers to build on top of a rock solid infrastructure to have that exponential impact.

How Did We Make Netflix Workstations?

If you've been listening to this, and going, but Michelle, how did you do this? Now is your time. We start on the left with an empty instance, and slowly add features until it is a Netflix workstation.

Why this, not that? Do less, accomplish more, that is, only write the code you have to and focus on solving the user problems. Lucky for me, one of the common design philosophies that you'll hear a lot in the corridors, or really the Slack channels of Netflix is the paved path. The paved path are tools and practices that are widely adopted and supported. At Netflix, as if you're doing a personal project, you have the freedom to solve problems any way you want. However, it will be much easier for you if you can use the paved path. I mention this, so when you ask me, why didn't you use x technology or y technology? Most of the time that will be my answer. Why make things harder than most, for myself? Of course, sometimes you have to clear a new road and carefully weigh all the options. A strong community adoption weighs heavily with me, otherwise, I'll never get any of my questions answered on Stack Overflow.

Configuration: The Machine

Let's talk configuration. Spinnaker describes itself as an open source multi-cloud continuous delivery platform that helps you release software changes with high velocity and confidence. It is a common Netflix tool that we use for releasing and maintaining services. It is on the paved path. We also use it to control the creation of workstation pools. Workstation pools are groups of Netflix workstations with the same configuration. They're similar to an autoscaling group in AWS, but we have a custom service to control the scale. Configurations can vary widely, but of limited variables. For example, artists could need a GPU when doing graphics intensive work, or extra-large storage to handle file management. Some artists need CentOS to support their compositing software, while others require Windows to use their pre-visualization software. The workstations need to be as close as possible to the artists to minimize lag. We support a growing list of regions and zones.

Spinnaker uses pipelines and instructions for creating of pools. An API in conjunction with variables in the pipeline create workstation pools programmatically. Initially, we created big pools of workstations that only had the OS and a few internal tools. When the artist requested a workstation, all software was installed just in time. That led to long wait times and unhappy artists. Most artists were requesting a handful of standard configurations and did not need that maximum flexibility. Instead, we created a service to take the most popular configurations and preemptively create pools for them. Now, artists can get workstations in seconds in most scenarios. However, it is also inefficient, and that sometimes workstations are sitting around lonely and unused. One of the improvements we're working on is using images more frequently, creating a large library of configurations in order to be able to spin up pools faster. I'll get into that more in the next section, software configuration.

Configuration: The Software

Today, there are over 100 different packages that can configure a workstation, from installing software to editing a registry. How did we get here? We needed a system that can manage hundreds to one day thousands of workstations. It needed to be extremely flexible, while easy to jump in and create new packages. That is where SaltStack comes in. We use Salt to make operating system agnostic declarative statements about how to configure a workstation. It has many built in modules from installing a package to file management. It also allows for logic statements to handle situations such as mount the storage in this environment only, or only run this script if this file does not already exist. This Salt formula example is the equivalent of running a yum install in the terminal. This module is actually OS agnostic, and it should find the right installer, even though I gave the CentOS example.

Why Salt instead of Ansible, or other configuration management tools? A few reasons. It was used recently in other Netflix projects and got high marks. Also, Salt is Python based. Python is the standard language for VFX and animation tools. The production engineers are the ones I mentioned, who will be building the pipelines between workstations and artists. We want to meet them where they're at as much as possible. Salt is designed for maximum flexibility. It is great when you want to be able to do pretty much anything to a pool of instances at any time. It is bad when you can do pretty much anything to a pool of instances at any time. As I mentioned in the Spinnaker portion, we're moving towards an image model instead of an install at start model. Salt will be used as a provisioner to make the images instead of running on the instance as it's spun up. There will always be some user based configuration, but anything involving a general installation can be done on the image instead.

Lifecycle

Now that we have a pool of workstations, we need to track the lifecycle. We need to know how many are available, how they're configured, if they're in trouble. That's where the control plane comes in. We use a Java Spring Boot service with a Go agent on the workstation. Java and Spring have a strong paved path. Most of the services for the team use that. Go makes it easy to cross-build executables for different platforms, and we knew early on we would be on multiple operating systems. The agent provides the workstation heartbeat, while the control plane is a source of truth for all the workstations, providing endpoints for both of our internal team services and for the other teams at Netflix.

Acquiring a Workstation

The product part of a Netflix workstation is accessed by the UI. It includes the ability for artists to directly acquire a workstation. If you've seen other Netflix presentations on internal tools and are thinking, that looks familiar. You're right. The UI team at Netflix provided a bunch of reusable components for us, so we can focus less on the design and more on the business logic. Good old paved path again. The UI is also where we've put some key components for our move towards self-service. It includes giving artist team leads the ability to create configurations and control access to them. It is also for operators to visually check on the health and usage of workstations. Things like checking logs when a support request comes in, making sure an artist is using the correct configuration. We also have some overall dashboards to check on things like, are we out of licenses, or did the autoscaler go haywire and make a ton of workstations with Steam instead of compositing software?

Accessing a Workstation

After an artist acquires a workstation in the UI, or through the API, how do they actually get to it? There are two remote display options we currently support: NICE DCV and Teradici. The artist can open a workstation in a browser or a native client on their desktop. They can also use application streaming to simplify the experience to a single artist tool. This has been popular with some of our partners to give a streamlined experience from path selection to having the asset open and ready for drawing.

What's Next for Netflix Workstations?

Now that you know what we have already done, here's what's next for Netflix workstations. We want to continue moving from a full service product to a self-service one. The fewer calls I get to do things, the better. We want to make configurations clearer and faster, continuing with the work of moving to images instead of just-in-time installations. Finally, there's what I call the Netflix Workstations Extended Universe. Like the Christmas Prince Universe, crossovers and spinoffs just make things more fun. Workstations provide one part of the infrastructure for an artist's workflow. Providing easier integration points makes it possible to connect to the other points of the workflow from task management at the beginning, to the creation of final frames at the end.

What I Learned from Netflix Workstations

I want to talk about what I learned in the past year. When this project started, everything was new. I was new to Netflix, on a brand new team with an empty repository. I've learned so much in the past year, and I wanted to share something. All these might sound obvious, but there can be a wide gap between intention and implementation. Move fast, but don't break things sounds impossible, so maybe just get as close as you can. Our timeframe got compressed because of external issues but it still isn't fun when someone's first experience with a product isn't delightful. Observability and testing can seem like things you can cut when you're in a rush, but they come back and haunt you so quickly. I am excited about our recent commitment to improving ours, and the automated configuration testing we've set up could be an entire talk on its own.

Empowering others to develop with you gives you way more leverage than you could do on your own. When there's a lot to do, it can seem like you need a heads down and power through, but the more you build in a way it makes it easy for others to contribute, the more adoption and help you'll get. Help other engineers help you. One thing at a time. This is useful from the micro to the macro level, from small tickets and small pull requests to small releases. If you try to abstract too early and do everything at once, you will never get it right. Doing one thing, validating and iterating is a way to know your system is robust and on the right track.

Questions and Answers

Montgomery: I can't think of this being any more appropriate for the track, because we tend to think so much in terms of performance, or how to get something out of something, but we very rarely think about things like, let's put an artist in front of this that's used to a certain type of workflow. How do we be able to do some of the things that they all need to do with their tools, but do it in such a way that they can do it from anywhere? It's a fascinating thing.

Relying on remote workstations force you to be connected, don't you think it would be a problem for end users? What if they want to go outside and work from a park?

Brenner: They're probably not working from a park just because of the lighting. Most artists need very dark rooms in order to do their work, so that helps me a little saying that they usually don't need to go to a park. They have to stay connected to the internet. They need pretty good speed. You have to work somewhere with the Internet.

Montgomery: Is there a similar story for developer workstations at Netflix?

Brenner: We've been playing around with it. There's this role called TDs, in visual effects and animation, and they're between the artists and developers. They're doing some development on the workstation. There are people trying it out. It's not like the whole company is doing it, because our focus is really on the people making the content. We're focused on them, but there are people who are developing on the workstations for the artists, so pipelines and tools and scripts and things like that.

Montgomery: I do have a little bit of experience working with some artists in the game industry. I learned a couple things. One, I'm not an artist in any way, shape, or form to the level that a lot of those are. I noticed that the tool chains in the workflow for a lot of that is incredibly intensive on a couple things. The performance angle comes in, it just doesn't come in at the same way. There's lots of things that in the gaming industry, for example, artists who look at doing things, they're optimizing for lots of things all the time with how they're doing the art. I assume it's somewhat the same. I'm sure it's very different in terms of what you can give up a little bit here to gain a little bit here. That equation, I'm sure changes. Doing this, what specialized hardware do the artists need that you mentioned? I can guess a little bit but I just wanted to hear.

Brenner: You really need good GPUs, for all the graphics intensive work. You need it to be as close to the user as possible, because latency is obviously the killer of remote workstations. If it takes an extra second for your Wacom to interact with the remote stations, you're just going to get mad and throw your laptop out the window. Then just the ability to change quickly. You could be someone who's working on really intensive work, or then sometimes you need a huge hard drive to move files around and download giant reference images, and things like that. It's just about being very flexible on what you can get on a day to day basis, and not just building the biggest, the best, the newest computer, once, shipping it to you and then just having it. It's about being flexible.

Montgomery: By pushing a lot of the infrastructure that would normally be right here in front of you to the cloud, and making the client really thin, did you have to give up anything that the artists relied on? Latency, I can definitely understand. I'm just curious if there were any compromises, changes to workflow that had to be done?

Brenner: The biggest mindset change is people are used to having computers as pets. You have your laptop, you make changes to it. You assume those changes aren't going to go away. While our mindset was these machines are ephemeral, you're going to get a new one every time you can get it, so you can get the latest software, the latest security. The idea is that we're not changing your workstation, we're just going to give you another one that's the exact same as the other one. Figuring out how to make that transition as smooth as possible, because if you have to wait 20 minutes every morning to get a new workstation to get your work started, that's annoying. You're used to just opening your laptop, or your desktop and being able to work. That I think has been the biggest change, because cloud hardware can go away at any time. It's why you have a server group with 20 instances in it, because if one goes down, it's ok, because there's 19 other ones. It's just a matter of changing that mindset and making that transition as smooth as possible, because it's definitely something you give up, as the, this is always going to be there because it's sitting in front of you.

Montgomery: What experiences do you offer, Windows or Mac? What OSs have you seen be the most popular?

Brenner: CentOS 7 is very popular. I think that surprises people, is because a lot of the digital creation apps are based on that, so we're following the industry standards, and what people are used to. It's very common in VFX animation to go from studio to studio because of different projects. We try to use the off the shelf tool, so it's easier to onboard people. A lot of those are CentOS based. Then also Windows. We're using CentOS and Windows.

Montgomery: That is fascinating. It's very interesting. Windows is pretty much the main thing that I saw working with some gaming studios. It's very interesting that CentOS is up there. It totally makes sense, but it's just fascinating.

How is Netflix helping people to handle quarantine bad habits of health? Is it hard to force people to make exercise or things like that?

Brenner: That's outside of my purview. I'm all about helping them work, not about helping them take breaks. Hopefully they can. It's so easy to work, that you get work done early, and you can take a break.

Montgomery: How long does it take for a workstation, the hardware, software, and operating system to get validated entirely? My experience, it has been an uphill struggle to get something bespoke or the latest and greatest, and wondered whether it would be different in any other organization. How to tackle this?

Brenner: For us, we can get it out very quickly with Salt. You can install something on someone's workstation in minutes, if you really need to. We try to have some gatekeeping in terms of testing things out before it gets to them. We might need to buy licenses or something like that. There is a time delay in terms of getting it right. If you need to get something right now, someone's like, "I just really want to try out the latest version of the software. I promise I won't complain if it's broken. I want to try it out." With that we can react very quickly, which is nice with Salt. For the day-to-day, that's where we lean a lot on our artist team leads and the TDs. We'll say, "We have these new Salt formulas that are the instructions for installation. Try it out, get back to us. Go back and forth until you feel good about it. Then you upgrade it." It's very similar to the dev, testing, production pipeline where you're trying to stage things so that things go out well instead of messy. If you really need something, you can get it out very fast. That's one of the nice things about the cloud. Like, you need a different type of instance? Great, let me just fill one out for you.

Montgomery: Is it pretty common though for a lot of the artists to try out different things or are they fairly stuck in their go to A, B, C. That's the process, or is it all over the place?

Brenner: It really depends. I think it's really about their role and what team they're on. They tend to have very specific things. Often, projects we'll want to lock in. Once they start working on it they won't want to upgrade that much. What they will upgrade is their TDs will create new tools for them that we'll put out, so like custom code, things like that. They won't change their DCC software. They're not going to change from Photoshop to something else in the middle of a project. They will change scripts and tools that they're using, as they work on the show. It really depends. A lot of it is, what is your role, and here's the standard tools that you would use for that role, and locking in for the project what version is it that they want, what customizations they want. It's being flexible like, whenever you need something, making it available to that self-service of getting out whatever they need.

Montgomery: Regarding security, do you allow people to plug in a USB key? Could work in progress be downloaded?

Brenner: We don't want you to download our images to your computer, please don't do that. I don't think it's possible. I haven't actually tried it. What I mentioned in the talk is that security of keeping everything very self-contained in the cloud. You're not downloading images to your computer. You're just working entirely on there. The USB key hasn't come up as a use case. I'm going to say no, since we haven't tried it.

Montgomery: How do workstations handle things like drawing tablets or screens?

Brenner: You can use a drawing tablet. You can use two monitors. A lot of that is handled through the remote display protocols. We support Teradici and NICE DCV. For that, they handle a lot of it. For us, it's just a matter of finding the right settings, because there's a lot of settings in there, and then the KDE, and Windows settings, and just tweaking them until there's a better artist experience. Yes, that handles that for us.

Montgomery: What different types of artists, for certain things like artists who are drawing or doing images, things like that, is there also sound and things as well?

Brenner: We have editors, so I'm sure there is sound involved in that. It's just specific, like audio production. I don't know if we have any of those. I do know we have editors, so I know they're using sound. I don't think I've specifically gotten any questions related to that. It must be working since no one has complained about it.

Montgomery: You can always think that with that way.

Brenner: I like to believe that they would have said something if they couldn't edit correctly.

Montgomery: If you get a new workstation every day, how do you assist with personalization, such as custom shortcuts, UI themes, or just window layouts?

Brenner: The user settings carry over for your workstation. The last step in getting a workstation is getting all the user specific ones. All the software installations and all the generic stuff happens as early as possible. Then when you say, I want a workstation, it says, where are all of Michelle settings? Ok, I got it. Then it plugs those in. The idea is that you come in, and it looks exactly the same way it looked the day before.

Montgomery: I assume that also is with things like locations for files and being able to share those, and any customizations for scripts. I'm sure there's lots of scripts that are involved in the workflow process. That presents a huge amount of metadata for a user that we usually don't think about, but I assume that you've had to handle all that.

Brenner: Yes, just making sure that that moves from workstation to workstation was really important. Because no one wants to start fresh with a brand new computer every day. They really want to have the way they had it before, so that was a big part of setting this up, is to make sure those settings move over and over. It's a delight in Windows.

Montgomery: What about the security angle, what can they install and can't? How much control do the users have, and what they can do or not?

Brenner: They don't have the admin access to install things. TDs and tech leads have the ability to do a little more. We have this whole role-based access control that controls what people do. We really don't want people to like Wild West install whatever they want. We want to control this system so that we can have the chance to test it, and not just be like, "I installed this thing, and it doesn't work," and I've never heard of that software before. I'm like, I don't know if it works. Also still having that freedom to do things, like look at reference images and things like that. It's not completely locked down, but we don't want people just changing too much, and running into problems. It's like semi controls.

Montgomery: Who owns the workstations, Netflix or the artists?

Brenner: Netflix. It's our virtual workstations, we'll provide it for them as they work.

Montgomery: Just like I'm sure the assets belong to Netflix, or the studio, or whoever is involved in the legal process.

Brenner: The same way these laptops belong to Netflix.

Montgomery: How large is your team? You mentioned a lot of tools, I think you said something like over 100 or more different tools. It covers a wide range. I'd just be interested in how big the team is, and how things are split up.

Brenner: We started with four people, and then we grew really big. Then the team was too big. Then we split. Then we split again. Part of it is that we rely on a lot of the other teams at Netflix to help us. We'll do part of it, but then rely on them for core components that are reused throughout the Netflix ecosystem, so like a Netflix platform team, and things like security and networking and stuff like that. We try to reuse components as much as we can and rely on other teams who have an expertise in BaseOS, or things like that, while also trying to make it easier for lots of people to be able to build these tools. Each one of those tools is a little YAML file. The idea is that we create the system, and want people, even if their day-to-day is engineering, to be able to go in and maybe write that YAML file for themselves. We're transitioning from us writing all the initial tools, and getting the initial artists, to everyone being able to contribute, and be able to control what they put on their workstations for their team in any way.

Montgomery: Where do you see workstations going over the next one to three years?

Brenner: Workstations everywhere. More locations, as more people start working from different places. More different types of artists using it, so running into more interesting use cases, I think. Then just also, as I mentioned, like this self-service and then developing on top of it, so making less, on that spectrum, of something is a product versus a platform. Leaning very heavily in the platform so more teams can customize more, and build whatever workflows they want on top of it, instead of just everyone having to use the same workflow that we have.

Montgomery: Can artists still get custom physical hardware machines if they want?

Brenner: I'm not sure. I believe that's in the purview of their team lead and whether they've decided to use workstations or not.

 

See more presentations with transcripts

 

Recorded at:

Mar 18, 2022

BT