Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations On a Deep Journey towards Five Nines

On a Deep Journey towards Five Nines



Aashish Sheshadri discusses how PayPal applies Seq2Seq networks to forecasting CPU and memory metrics at scale.


Aashish Sheshadri is a research engineer at PayPal, where he currently ideates and applies deep learning to new avenues and actively contributes to the Jupyter ecosystem and the SEIF Project. He holds an MS in computer science from the University of Texas at Austin, where his research focused on active learning with human-in-the-loop systems.

About the conference is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.


Sheshadri: Thanks, everyone, for making time for this session. Today, I'd like to share some of the work we're doing around modeling time series data from an alerting and monitoring standpoint for PayPal, and in doing so, I also want to shed light on the work we're doing around some of the tooling, and infrastructure pieces to enable our data scientists, analysts, developers to build reliable and reproducible machine learning pipelines, because I think that bit about everybody being able to build these models reliably is important as well when it comes to using this in practice at the enterprise level.

To that end, the agenda for today's talk is going to be a brief introduction to machine learning at PayPal. Since I'll be grounding most of the talk around monitoring, I'll introduce what monitoring looks like and what are some of the challenges we're trying to solve. Then, we'll switch into how we can apply deep learning methods to do some forecasting. Then, we'll shift gears into some of the platform pieces and tooling pieces of figuring out using Jupyter Notebooks and attaching it to HPC to enable training at the enterprise level. Then, I'll quickly go into a demo to show how all of this works together and we'll wrap it up with conclusion.

Machine Learning at PayPal

At PayPal, we've been practicing machine learning applying statistical models, data science principles for over 20 years, especially in the risk modeling domain, that's been our workhorse. We build machine learning models at different levels of interaction a user might have in the PayPal ecosystem. Here are few of the examples, some of them are pretty obvious, like first-party fraud, seller mal-intent. Some of these can be new, like delayed settlement. While risk modeling has been our state-of-the-art go-to kind of area, we are slowly expanding application of machine learning to other avenues. One area that we are looking at seriously of applying machine learning is monitoring and alerting.

This is an actual view of our command center at PayPal. The sole goal of the command center is to make sure that PayPal is up and running 24/7 and it's available worldwide. That translates to having to monitor 350,000 signals every second, that's right, 350,000 signals a second are the number of signals that we have to monitor. We have a trump card, we have an exact replica of the "Star Trek" command seat in the command center, but it's not quite enough because having to monitor these many signals is not easy and we clearly need some automation, which we have in place in some form of intelligence to automate more of this monitoring.

To give a sense for what some of these signals look like, let me show you a few. Here is a signal we are looking at, it's signal count monitoring, it's the same signal from two different pools. In green, we have the signal from pool two and red is pool one. Even though it's the same signal, it's behaving differently in different pools, if I had to build a model for our signal, that's not possible. I have to hand tune it for different pools, and the pool characteristics can change over time.

Here is another signal, we're looking at a few latency monitoring here. In this signal, if I had to figure out what an outlier would be, there are many possible candidates purely. I can do some form of fresh holding to do alerting health monitoring flows, but this signal is really part of this overall bigger signal. It was part of this two-month period. Now, if I come back to that question of what an outlier might be, it looks so different when I change the scale. Moreover, we are seeing obvious things like seasonality that needs to be modeled in the signal, code pushes can cause changes in behavior for a signal. There are many of these inherent properties that time series can have that can be modeled with machine learning, and that's what we're trying to approach and get better at doing.

Time Series Forecasting

Forecasting can help with a lot of these things, forecasting is a pretty old problem, it's a hard problem, and we've seen earlier talks today talking about forecasting. In the past, at least most times simpler statistical models like ARIMA would do much better than neural networks, but today with some of the state-of-the-art, and mainly the advances we've made in the ability to process and compute the advances that we've made is helping us use some of these deep learning methods to do time series forecasting.

Here is a simple sine wave, if we're using a neural network and we want to predict the sine wave, it should be possible. If I have to predict this, then I'll have some sample points between zero and two pi. I'll train the machine learning, our neural network, on these samples, and then ask for inference on these other points.

It's going to work well because neural networks are universal function approximators, so they're supposed to learn any function, but really, it learns a function between zero and two pi, so it can be any tailored series approximation function. When I try to start looking into the future time steps, because the sine wave literally, goes from minus infinity to infinity, and I try to infer these new points, it's going to fail. It's going to fail because it's not seeing those data from those time steps, it does not know what it even means, so interpolation is going to be a problem.

Recurrent neural networks solve this because they start modeling sequences. Error is propagated over previous time steps, and because error is getting propagated over the sequence, we can model dependencies between samples, and it helps approaching the sine wave probably much easier.

With the RNN, we have samples that go more like this, it's a sequence. In this sequence, the network has the opportunity to learn about inflection points, it has the opportunity to learn about dependencies between samples. When I try to get an inference, it's going to work between zero to two pi, and it's going to work further ahead as well. With the sequence, the thing that I talked about is error of propagation moving back in time. How far back in the signal are the dependencies that we're looking to model? In the sine wave, it's pretty small, as long as we have it within half a period, we can model some of these dependencies, but for different signals, maybe their dependencies are further away, and gradients are generally multiplied back through, and when you multiply small things, it gets smaller and smaller, and when you multiply larger things, it explodes. With normal RNNs, there is the issue of exploding gradients and managing gradients.

That's where there's a lot of work done around LSTMs and GRUs. LSTMs convert that multiplication problem into an additive problem. They also manage state, so now we can push how far back I want to look back to model dependencies in a signal. LSTMs and GRUs are pretty similar, GRUs make more assumptions. In that sense, it's less expressible, but for our practical reasons or purposes, it works just as well as an LSTMs and it is a lot less computationally intensive.

A lot of the work in sequence-to-sequence modeling has been around NLP and little is applied back into time series. Typical problems are trying to predict the next word, trying to make translations, trying to do speech recognition. Sequence-to-sequence methods work here brilliantly because we're trying to model things like grammar, we're trying to model things like context, and it's perfect for that natural language processing use cases and domain.

But if you look at a time series, we can start breaking that time series into sequences as well. Here is an example where you can just have a moving window that's moving through your time series, and we can start breaking our set of samples into our sequence. There are some more hyperparameters that we have to think about- how far, how big is the sequence that you're trying to model? What are the dependencies? Some of that domain knowledge has to come in, but it is possible to borrow from that state-of-the-art that's being progressed in natural language processing and apply some of it to time series data and see what's there to be gained and what's there to be found.

Talking about time series modeling, a sequence-to-sequence network works out-of-the-box pretty well when we are modeling time series data. We can have more complex models by adding attention or creating autoencoders for the data, etc., but for the purpose of today, I just want to stick to a simple model because I want to touch upon some of the infrastructure pieces as well. The reason I like the sequence-to-sequence model is for two things, one is it's trying to encode your time series and trying to extract some of the things that are important in that signal, some of the things that can be identifiable and unique, and it brings a representation of the time series into a different state space. The decoder then also automatically builds in dependencies, so every time we're making predictions, we've taken previous predictions and it becomes a generative model, so it helps with forecasting.

Generally, if at this point, I would just open my laptop, kickoff, TensorFlow, PyTorch, or whatever and start building my model, but we want to build this at the enterprise level. How many of you have seen this work or have seen this diagram? It brings out a key important saying that machine learning, the code piece is actually pretty small, and there's a whole ecosystem around it that needs to support that you're modeling the data ingest, the data reliability, the model performance, reliability, and all of these other components are much bigger if we want reliability in application of machine learning, and if we want to apply it to real-world problems at scale. That's what I want to segue into some of the tooling pieces through Jupyter Notebooks, and then the interaction or Jupyter Notebooks can have HPC kind of clusters to start building these training pipelines.

Jupyter Notebooks

How many of you are familiar with Jupyter Notebooks? Yes, it's really taken off in the data science community, it's an amazing piece to start prototyping and building notebooks, but notebooks can be a lot more than just prototyping. For those who have not seen notebooks, a notebook is basically a mixture of markup and code. Here is an example of a notebook where the initial it's broken down into cells, the first cell is a markup cell that talks about what the notebook is trying to do and then we have a couple of code cells that's doing code, and then we can also embed visualization within that notebook and give that representation.

The amazing thing about notebooks is that they isolate compute from code and text, so I can have a notebook that's shared across that has code and text, but when I associate compute through a kernel, that's what we're seeing on the right side, which can be a Python kernel, a R kernel or a Scala kernel, whatever that kernel is, we can start running that notebook and getting results by submitting stuff to that kernel.

At PayPal, we've absorbed notebooks, and we've added something called PPExtensions, a set of extensions to improve the user experience and reduce time to market for our developers. As part of that, we have a bunch of magics, you can think of magics as a wrapper, or syntactic sugar API, whatever it is. For example, we have some magic like teradata, so if I do %teradata, generally percentage is the signifier for a magic in the Jupyter world, so percentage teradata, I can start querying teradata. The other thing we've done with notebooks is we have GitHub sharing and scheduling for notebooks themselves, and this is open source; you can do pip install, PPExtensions, and you can start looking at using all of these features. I'll go into some of the magic that we'll be using today.

Before that, I want to talk about a typical flow a data scientist would have it. There's one box missing here that's data discovery, but once we know the problem we're trying to solve, there is this whole bit about data exploration, and then we have a bunch of steps that go with preprocessing the data in different ways so that we make it suitable for our use case. This can include scaling, removing seasonality. It can be labeling the data, correcting the data, removing bad data, whatever these preprocessing steps are. Then, finally, when we come to the piece of modeling, we are picking out models to train the model, so they can be any models that we are picking, and then we have a whole series of loops with hyperparameter tuning to make sure that the model reflects the data that we're using, and we're getting performance the way we want it to be.

In that, generally, if we had the world of notebooks, this was a study done last year from a design lab in UC San Diego where they created a bunch of notebooks, a million notebooks and they found that most notebooks have either one word of text, so they're using it as scrapbooks, or they have so much text that it's very descriptive to be reproducible. Then we started thinking, "All of these steps can be broken down into reusable pieces" Just going back to software engineering principles, and then we can start associating each of these steps with notebooks, and we call this template notebook.

If there's a data processing step that can be reused and we make it a template notebook, it's a template notebook in the sense that we can provide parameters that make it general enough to be used in several use cases and put it as part of a pipeline to build these kinds of tools. The two magics that help us do this is the run magic and the run pipeline magic. The run magic enables us to run these notebook templates, it will help us run notebooks in parallel and multiple notebooks in sequence. The run pipeline does the same thing, but it associates state with a pipeline.

Let's go back to that same problem of trying to build a time series forecasting model, and for that, we look at building a pipeline with a couple of notebooks. Here, we have the run pipeline magic, and we've broken down that pipeline of forecasting time series data. The first thing I'm doing here is downloading data from TSDB. TSDB is our time series database where we store all our time series data. I'm trying to model, today, memory data from a single host, to that notebook, I can provide a bunch of parameters, and here the parameters are what's the time series I'm looking for? It's memory time series. What is the time period I'm trying to get the data for? Then, there are a bunch of preprocessing steps that follow.

These preprocessing steps include interpolating missing timestamps, which were not available, or there was data loss, splitting our data into training and test segments, removing seasonality, scaling the data so it's appropriate for the machine learning model. Then finally, building that sequence-to-sequence model or the RNN model at the end.

If I was a data scientist, I would probably want to focus only on that last piece, and all of these other things can come from other folks in the team or the organization. For example, I might not know exactly how to access data from a certain store, but if that's available to me as a template notebook like so, I can just provide some parameters and get the data. The other thing that's key to note here is that all of these parameters are metadata, so given the same parameters, the whole pipeline has to be reproducible, so when somebody executes a pipeline, we can cache these parameters, and because of the statefulness of the pipeline, it can be reused later.

You might be wondering how these notebooks are chained together? How is data being moved between the notebooks? There is a dictionary called pipeline workspace that's available to all notebooks in this pipeline through which data moves. It can be data, it can be methods, it can be objects, it can be anything, and we'll see some of that in the demo later. After training the model, if you want to save the notebook and the model, we can do that as well.

Then we have the run magic, which us to run notebooks in parallel. There are a couple of reasons why we might want to use this, one is when we're doing hyperparameter tuning, we want that same model notebook to run under different parameter configurations so we can use the run magic to kick this off in parallel. The other bit is when we want to do some kind of distributed learning for TensorFlow, we can associate the same notebook to act as a parameter server or a worker. In this particular case, when we're looking at the use of GPUs, we can associate particular CUDA devices to each of these notebooks.

That's good and bad, if I were working on my own GPU host, then I can kind of associate these individual GPUs and do this, but at the enterprise level, we want the GPU farm or whatever to be a resource that is used and shared by everyone. In the typical notebooks world, today if I would just open and start a notebook server, it would start up a notebook server, and every time I start a new notebook, it's going to kick off a new process, which is a kernel, which is running locally. At the enterprise level, we have another piece called JupyterHub, which enables authentication of users to get into the shared environment.

Once we have that GPUs, GPU compute is not elastic or shareable, so if one person is using a GPU, then unlike a CPU, that GPU cannot be shared. If it was a CPU, then we can have hundreds of notebooks running even if there's just eight cores. There is a time sharing that's happening automatically out-of-the-box by the OS for us.

The important thing here to note is that notebooks and kernels are two different things. Kernels can be remote, that compute can be remote, it does not have to be local to us in that node, and that's where we moved the kernels to a Kubernetes cluster, so then the compute is not associated with a node, but it's available to us in a cluster and if that cluster has GPU resources, we can consume GPU resources. If we want high memory, we can do that, and people can start playing around with different resources.

In the new kind of view, what it's going to look like is users come into the JupyterHub, get authenticated. At the enterprise level, this can be a lab two-factor authentication, whatever that method might be, then that kicks off a notebook server at the enterprise level so each user gets their own notebook server, but what's different now is that the notebook server communicates with Kubernetes cluster to start getting compute.

There are two open source projects that support this, one is NB2KG and Enterprise Gateway, both Jupyter projects. The Enterprise Gateway can act as a load balancer or a kernel lifecycle manager. Every time a user comes in, starts a new notebook, they'll get routed into one of these pods, which is the Enterprise Gateway, and every time we start a notebook, it's going to ask the cluster to start giving it a pod that is doing a particular kernel, which can be Python 3, it can be any of these things, but the important thing is at this point, everything is kernels. Every time somebody closes a R Notebook, the Enterprise Gateway is going to take care of that life cycle of killing that kernel, getting back the resource and freeing up the resource, etc. We know that containers are stateless, so we can associate persistent volume and attach that to each of these Docker containers so users have that data, or users are aware of their notebooks that they're using.

Moving to this new world gave us two things, one is compute is now elastic. People can request for CPUs, people can request for GPUs, and because users are getting authenticated at the enterprise level, we can have methodologies in which users have access to certain kinds of GPU compute, because everybody wants to train on a V100, but that's an expensive option, and we can have cheaper GPUs depending on what's available and what's not. The other thing which is big is once everything is a container, your whole environment will be packaged in that container, so if you want a specific flavor of TensorFlow, or if you want a specific version of CUDA, which my code depends on, we can have that customization possible, and every time a user opens a notebook, they'll have their custom flavor ready for them to use and consume.


With that, I want to switch into the demo to show how it looks, here is the notebooks interface we would see and then when we try to create a new notebook, we get a bunch of kernel options. Some of these are custom, large, small, TensorFlow, TensorFlow with GPU. If I start a kernel with GPU, it connects to the Kubernetes cluster, it requests for a GPU and attaches one GPU to this notebook. Automatically now, a GPU resource, which was constrained before, now can be available to users to use at request.

The other bit is the fact that there's user authentication, so this is at the enterprise level, and when I came into the notebooks world, it identified me as being Aashish Sheshadri, and it also is persistent and I have my notebooks, my data, whatever associated even though it's a Docker container in that Kubernetes world. The other bit is customization, here we have a particular version of TensorFlow associated with this kernel, it's 1.8.1, and we have some version of CUDA associated. For my use case, if I want a different version, I can spawn up my custom container while that's loading up. Now if I do the same thing, I have a different version of TensorFlow, and I'll have a different this CUDA tenance of CUDA name, so this kind of user-level role-based personalization is possible and the same personalization can be applied to compute as well.

Next bit I wanted to show is what a template notebook would look like to have users share that same kind of workflows. Here is a notebook, it's called interpolate the time series. It has what the interpolation method you want to use? It has, what's the frame name? How do you want to save it? What is the resolution of timestamps, and if you want to do it in place or not? These are those options that we can give users, so if they want to consume this notebook, or they want to play with are these parameters before they put it in the pipeline.

Coming back to that training, that time series model, here is that pipeline, same thing, I'm downloading data here from our time series database, and I'm interpolating it, splitting it, scaling it, and then I'm building my sequence-to-sequence model on top of that with certain parameters. Before that, I load the PPExtensions, which are magic and then, when I run this pipeline, it's going to start running it in sequence. First, it downloads the data. For now, it's doing it locally, but it can also be within Spark or something, where it's shared compute, where the data is gathered there, we do some processing, and it comes to us batch wise.

While this is running I'll show some of the other pipelines that we can build on top of it. Here, it's the same pipeline, but the last notebook is going to do inference instead of actually training the model. In this notebook, based on that forecast inference, we can detect outliers, which would use here right now for that outlier. It's going to train Gaussian Mixture Model to show what it was looking like in the past and what's the likelihood of the data in the future. Let me touch base on one more aspect and then we can come back to this guy here.

Here with that same notebook, if I want to kind of play around with parameters, I want to change the bat size up, I want to change around with state size, I can say, "Ok, these are the different bat sizes I want to use, these other different state sizes I want to use." Then, I can kick off an execution of the same notebook under different parameter configurations in parallel, that starts running and saving the notebooks depending on what is the use case.

Now it's finished downloading the data, then it's doing the other steps and then, it's finally building the model at the end. The nice thing about this is if you are to have a pipeline, which is doing essentially the same steps, because when we're doing inference, we want the data to be preprocessed in the same way as it was doing at the time of training, but the only notebook that's different is the inference notebook. Ideally, I can reuse a lot of the work that was already done. Here the only notebook that's running is just that inference notebook at the very end, because it can reuse a lot of the work that was done as part of the training cycle.

Going back to the slide, let's look at what does forecasting look like with that sequence-to-sequence model, I pulled up the results because training the model can take time. The way the model was built is the model was trained over a month of old historical data. The length of the sequence was one hour in the past, and we're looking for next minute forecast. This is the forecast for the next seven days for just memory usage from a single host, so it's just one of the signals. It tracks it pretty well, there's a zoomed in version where we see the forecast and the real data, so the ground truth is in red and we have the forecast in green. What we can then do with the forecast, and what's the point of the forecast is that again, here we have the 7-day forecast, that's the next week forecast for that memory time series. If I take the previous week, build a mixture model, get the probabilities of likelihood of data, and then build that log likelihood for the next week, I can do it for both, I can do it for the actual data, and I can do it for the forecasted data.

What that's going to give me is that here is a particular probably an outlier, so green is the likelihood from the forecasted data, red is the likelihood on the real data. Clearly, there's a discrepancy here, that means that point there is an outlier, so we can have more static kind of methods by looking at just likelihood ratios instead having hand-tuned static analysis for thresholds and things like that. Here is another small bit, you see that U dip, which the model had support for from the past data? The likelihood for both the forecast and the prediction, the forecast and real data matches, and we don't see that kind of outlier, so when we're looking at outliers now with forecast, we can do a lot more.

That was just one signal that I looked at, and I said we have 350,000 signals, so there's still 349,999 signals left to be monitored. The point I really wanted to make and the reason I called the talk "A Journey" is because we are building some of these infrastructure pieces that help us build reusable and reproducible pipelines so with these reusable components, we can start prototyping on different kinds of methods. We can start building hierarchical pipelines, which take into account forecasting, and a model that builds on the forecasting to do outliers. A model on the outliers which starts at the team level starts focusing on service-specific problems that we can take actionable items, actionable insights, or draw actionable insights, do something or expand the resource or whatever. The problem space starts growing and becomes closer to what we actually want to solve, and these can become building blocks.


In summary, I want to say that notebooks can do more than just documents. They can help us orchestrate things like machine learning pipelines at the enterprise level, they can be these windows into compute, they can be the windows into orchestrating different data sharing. Sharing notebooks can help us make these reusable pieces. We can apply recurrent neural nets to time series data, and they can help in leveraging and easing out alerting and monitoring flows, and it's been helping us a little bit and we're still on that journey to see how to scale it out, because a lot of these models are pretty expensive.

The nice thing about using these deep methods is that we can auto encode a lot of the time series, even though we have 349,000 whatever left, a lot of these have similarities, and through embedding, that space becomes much smaller. In that smaller space, having to build separate models is easier. Also, when I showed the forecast because these are more predictable, they're not as random. Some of these features are modelable, we looked at week-long forecast, so we can build a model a day, or a model a week, it is still possible to attack the problem with some of these methods.

Questions and Answers

Participant 1: We know that notebook actually can save data that's collected in JSON format. The question is how do you maintain the code through the change, for example, how you do the code review of all these things?

Sheshadri: There has been work, it does in the JSON format, but there has been open source work around tracking just the code cells, so we can commit notebooks to GitHub and when we pull notebooks, it actually looks for merge conflicts, or it looks for code changes, so we can track notebooks as well. There's work around that, even though it's JSON, there has been modules that help us do this.

Participant 2: What if [inaudible 00:36:02]?

Sheshadri: We have versioning and a lot of these template notebooks are more like notebooks that have been reviewed and are available as part of a repository, so depending on the version, your pipeline won't be disturbed. Even if the notebook gets updated, it will be associated with the new version and we can change pipelines like that.

Participant 3: How do you then build a caching between the bits in your pipeline? We substitute and download the data again; how is the caching and all the processing? Could you do that the next day, or is there an expiry date?

Sheshadri: These are small pieces we were building. The notebooks, the only data that was being passed between notebooks is that pipeline workspace, a dictionary, which is the state. This state can be persisted, it can be serialized, it can be persisted in object stores. The reason we are able to identify it uniquely is because all of the parameters are associated with the notebook and the notebook's name and the sequence can be hashed out to a unique expression. In that way, it can be searchable and reused today or tomorrow, or whatever that timeframe can be.

Here, since it was happening in parallel, some of these completed, some of the ones that were more complicated would have taken more time. When we are talking about data moving between the notebooks, it does not have to be actual data. It can be methods, it can be iterators to pointers within a cluster. Here we have a notebook that's setting up the Spark environment and from that Spark environment, it's going to create an iterator for some data, and the notebook that consumes the data can consume it through the iterator. That final notebook can be something as simple as it's just doing next on the iterator. I don't care about how the data moved into the cluster, but I just keep doing next, I get my batch of data, I train my model, and then I keep getting batches, or the same thing can be true for inference, as well.

Participant 4: When you have all your components of your pipelines and notebooks like this, how do you test them?

Sheshadri: I can persist each notebook to look at errors and stuff. For example, let me pull up a pipeline so we can see that. Here, there's a colon next to that notebook, so in execution, that notebook would not have been saved, but if I want to test a single notebook in that pipeline, I can associate a colon and have that notebook persisted, so I can know exactly what's going on at that level. There are testing frameworks for building our Jupyter Notebooks in general, that does automatic testing for a notebook pipeline, and this is something we're thinking about and also building out as well.


See more presentations with transcripts


Recorded at:

Jul 03, 2019