InfoQ Homepage Presentations Back to Basics: Scalable, Portable ML in Pure SQL

Back to Basics: Scalable, Portable ML in Pure SQL

View Presentation

Speed:

40:58

Summary

Evan Miller walks through the architecture of Eppo's portable, performant, privacy-preserving, multi-warehouse regression engine, and discusses the challenges with implementation.

Bio

Evan Miller was an Operational Excellence Engineer at Amazon.com. Most recently, he has the role of a Statistics Engineer, where he helps Eppo build out a world-class, warehouse-native experimentation platform.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Miller: I am a Statistics Engineer at Eppo, which is basically half data scientist, half engineer. I'm implementing statistical things frequently for our experimentation platform. You'll be hearing a bit about our platform and what it does, but that's not really the focus here. This is the ML Applications track. This is going to be about half of the application of why we want to predict things in an experimentation context, which might not be immediately obvious. Then, half engineering and really focusing on the algorithms and implementation. That's what you can look forward to.

What Is an Experiment?

To motivate this, I'm going to talk about the number one problem in experimentation. Before I get to that, I just want to step back a little bit, and just really describe what an experiment process looks like at a company, or even in the academic or scientific world. The first thing that happens when an experiment runs, just at a very high level is you have a lot of users, and you want to randomize them into at least two groups. Typically, you have a treatment and a control. In this little diagram, I've got users coming in with different IDs, and then I need to assign them randomly to one of two treatments. In this case, this might be like a marketing promotion. I'll show half of users, save 75%. Then I might show the other half this yellow sale button over there. The overall pattern, or the overall flow of an experiment looks something like this. You start out with a planning phase. This is going to be one of the focuses of the talk. You need to determine the number of users who will go into the experiment before you even start the experiment. This is something that's often overlooked, and can lead to some trouble if you don't pay attention to it. The second phase in the experiment is the randomization, that black box that I showed you in the last slide. You need some algorithm to randomly assign treatment, control, or to multiple treatments. Then, you need to wait and collect data on your users. You're going to be measuring something. You're going to need some fixed time period over which to collect data. It could be a revenue metric, just could be any user level metric that's going to be of interest to the business. Then, finally, after you've waited, after you've collected the number of samples that you determined during the planning phase, you'll analyze that data. You'll compute metrics, run statistics. Get some p values. Get some confidence intervals, and hopefully make some business decision to decide whether or not your treatment is a good idea for your business.

What Is Power?

I want to talk a little bit about statistical power. This is something that people who go through those four steps might not necessarily understand. Sometimes people have a hard time wrapping their minds around statistical power even after taking a stat 101 class. There are basically two types of lifts that we need to think about. A lift here is just, what's the percent increase in a metric of interest? We have what's called the true lift, which is, if we were to gather an infinite amount of data, observe an infinite number of users, what would the observed lift actually be? It could be 10.12489%. We don't have an infinite amount of data. We only have a finite user population. Instead, all we have to work with is the observed lift, which is not going to be the same as the true one. Power deals with the relationship between that true lift and the observed lift. It's defined in relation to something called a minimum detectable effect. The idea is, if the true lift is equal to some number known as this minimum detectable effect, what percent of the time will an experiment give us a statistically significant result? It's just a concept I find myself repeating quite a bit in the experimentation world, because if you don't have enough power for your experiment, your results will look something like this on the right, which is, you have wide confidence intervals. All your results are statistically significant and you don't really understand why. The reason is you just didn't have enough users enrolled in your experiment to see anything useful in the data. Power is just a recurring theme in the world of experimentation. It makes PMs sad when they don't get stat sig results.

The Power Problem

How do you compute power? There are sample size calculators out there. This is one. It's just a little JavaScript calculator. It has the core concepts here. If you have some process with a conversion rate that you know in advance, let's say a checkout rate of 2%, you would define a minimum detectable effect in relation to that. You say, I want to be able to detect a relative change of 10%, some percent of the time, and that percent of the time is known as the power. How many users would I need to enroll in this experiment? With those two numbers, you want to be able to detect a lift from 2% to 2.2%. You need about 78,000 users in each treatment of your experiment, which might sound fine and great. Ten percent is a pretty large lift to be detecting for most product feature changes. You start to really run into trouble when you dial that down and want to detect smaller effects, which you would want to do if you're just trying to test every single change that you might make to your website or your product. Just to show you a different set of numbers, and this is where people's minds start exploding, is if you want to detect a 1% change instead of a 10% change, you need about seven-and-a-half million users enrolled in each branch of that experiment. This is just going from that 2% number to a 2.02% number. If you're Facebook or another very large company, 7 million is nothing to you. You've got all the users that you need for very precise experiments. If you're everybody else, that number can look pretty intimidating. A recurring theme in experimentation is knowing when not to run an experiment because you just don't have the users for it.

Solution: CUPED

Is there anything we can do to reduce this number of users that we need for a successful experiment? It turns out that there is. I'm going to be talking to you about CUPED, which is a paper from 2013. This was a method invented by some researchers at Microsoft: Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. It essentially reduces the number of users that you need in an experiment, I say with magic, really just math. It's a technique called variance reduction. I'm going to walk through the mechanics of how CUPED works, and then show how we implemented it at Eppo. The core idea here is, we're going to be looking at how the users' behavior compares to their predicted behavior, and take the difference between those two things in order to get a measurement with less variance than before.

How does it work? This is the same illustration as before, but instead of just assigning each user to treatment or control, we're going to attach some additional data at that time of assignment. At the assignment time, the user has not been affected at all by the experiment, and so we're free to supplement that variant information with other information that we have about the user and construct a dataset with that. In this case, I'm going to attach previous revenue figures from those specific users. I might know that user_id 12 had $120 worth of revenue in the past month, and 43 had $30, and 67 had $328. I'm going to use that information in order to run a more precise experiment. I'm going to show you how that works. This is just a very simple data table that we can construct with the user_id, the variant, this predicted revenue. In this case, I have the world's simplest predictive model, which is that their last 30 days is going to be equal to their next 30 days. We're going to get into more advanced models, but this is just for purposes of illustration. Then, instead of analyzing the actual revenue that we get from each user, which is that fourth column, we're going to be analyzing the last column, which is the difference between the actual observed revenue and the prediction. These are just made-up numbers. The key insight here is that the variance in that last column is significantly lower than the variance in the next to last column. This is just the core idea behind CUPED is that we just have much tighter numbers, just looking at those residuals compared to prediction rather than just the observed values.

Viewed another way, this is just an illustration I put together for one of our blog posts about this. Without CUPED, you might have two different estimates of some metric of interest. This will be the ones on the left. You might have a control and treatment where the uncertainty is relatively wide for both of those numbers. Then once you apply CUPED, because you have less variance in your data, you have much tighter estimates on what the mean is for each group. With that, your p values are lower, your confidence intervals are tighter, and everybody goes home happy. Just another illustrative diagram here with CUPED enabled. This just means that, whereas your sample size calculator may have told you you needed 120,000 users before, if you flip this on and you provide some of this pre-experiment data and reduce the variance, typical results here are, you could get that down to like 50,000 or 80,000 users. Sometimes it's more than that. Sometimes it's less. This is just a rough ballpark of typical results that we've seen, and that other organizations have reported. None of this is new. This is all stuff originating from that original Microsoft paper pretty widely used in industry.

How do we implement this? It all sounds great. We get experiments that require fewer users than they did before. Let's go to work. This was the first version of our CUPED architecture. You read it from bottom to top here. In our system, customers have all their data sitting in a data warehouse. We have Snowflake, BigQuery, Redshift, and Databricks represented. The customers will have just event level data there. We crunch a lot of numbers, compute subject level data out of that for each experiment. Then we're going to want to run some computation on that subject level data. I've labeled these big data at the bottom, which is just event level data. Medium data in the middle row just refers to subject level data. Then, finally, once this Python process is done, which is using these technologies, SciPy, NumPy, and pandas, we get tiny data, which is just the actual confidence intervals for the experiment results. That will literally just be five rows going into our Postgres table at the end.

What's this middle row doing? It's a pretty standard data science-y architecture. We got a Python process that runs. It connects directly to Snowflake. It downloads Big table. It pivots it in memory in a way that pandas is going to be happy with, that we can run linear regressions on. Then, finally, we run those regressions using NumPy. This is nothing too special here. Pretty easy for data scientists to write and get it stood up and go to work. This is an actual screenshot. You can just see how simple that fit and predict method ends up being, just using Python, using pandas. It's all at the bottom. We get an estimator, we fit it, and we make some predictions using that estimator. We return those predictions, and then do all the stats on that array of predicted values, or the difference between the prediction and the residual depending on which exact version of CUPED that you're doing. It's nice, clean code.

We rolled it out. It worked well. I just took this example at random. We've got a little checkbox on each experiment result page. You can turn CUPED on or off. This one's more or less taken at random. The confidence intervals are just slightly smaller, but sometimes that can get a statistically insignificant result into a significant one. That makes people happy when they have those tighter confidence intervals. One thing worth noting here, it's not really important, but the point estimates from CUPED results are different from non-CUPED results. That just makes people's heads hurts sometimes. They're both consistent estimators of the same thing. The actual measured difference that you see, or like the observed lift is not the true lift in the first place. Likewise, with CUPED applied, the predicted lift or the estimated lift is also not equal to the true lift, but it's a consistent estimator of that. Just some little wrinkle that you might have to describe to CUPED users. That was all well and good. People were happy, until the scary ellipses. You start seeing things like this, which most developers have seen at some point in their life, in some way, shape, or form. In our case, it was the node was low on resource memory. Some container was trying to use a ridiculous amount of memory which exceeded the request size, and so it's evicted, and the Kubernetes pod is killed without ceremony.

This is an actual Slack message I got from one of my coworkers one morning. A CUPED process was running on one of our larger customers, he said, woah, pasted in this graphic of memory consumption, and what caused big spikey. This became a recurring pain point is memory. In-memory analytics all sounds well and good, but there ends up being a lot of operational issues in terms of provisioning machines with the amount of memory that you need. Knowing in advance how much memory you will need before you provision that machine, paying for it for like keeping data in-memory, the second point is about 500 times more expensive than keeping it on disk. Then, finally, getting data from your warehouse into memory can take a long time. If you look at this graph, this is actually just our Python process trying to download data from Snowflake for three hours and slowly growing its own memory usage. Then, finally, at the end of three, three-and-a half hours, it has everything in-memory, and then tries to pivot it. That's where this big spikey comes in, and kills the process. We just started to hit some pain. We did an investigation into our Python process and our CUPED jobs, it's just a screenshot from this internal investigation that we did, and about half the jobs failed. Some would take 13 hours and succeed, others would take 3 hours and run out of memory. A lot of this time, in the last column you see there, was just from fetching and pivoting data, really just downloading data from the warehouse, and then trying to get it into a form that pandas could use. This was my life for a few weeks, just trying to debug these issues. Maybe it looks familiar to some of you out there.

Python Alternatives

How do we get around this? We took the Python process to the limit of what we could do with that architecture. Then we started casting about just for some other ideas to get around the memory issues in particular. Two options that we considered are described here. The first is use a warehouse specific machine learning solution. Redshift has SageMaker, BigQuery has BigQuery ML. I'm not sure if Snowflake has anything yet. Then Databricks is hooked up to Spark. For our use case, we needed to support all four warehouses, some of our customers are on each warehouse. There would have been costs associated with developing against a specific warehouse, whereas the rest of our code base might have only 100 or 200 lines of warehouse specific code. The advantage of this approach is that you don't have to move the data out of the warehouse, or at least from the customers' perspective, it doesn't look like you're moving it out. There's still physical data movement happening from BigQuery's table system to their ML system, which does have time and costs associated with it. From the customer perspective, they just think it's all sitting there. The drawback, there are those extra costs associated with moving that around and using these fancy ML systems. There are also extra permissions required for some of this. We got an awkward situation where we're trying to ask customers to grant us more permissions on their warehouse account and trying to explain why we needed it compared to just the read only BigQuery account that we had. There are some real drawbacks to that approach. Another option would have been using Apache Spark, or Databricks, and just moving all that data out to a scalable system that knows how to do linear regression, and it's solved this problem of trying to run regressions on large datasets. That would have been nice because it would work across warehouses. The downside, of course, is that we have to move data out. We do have a lot of customers who are in constrained regulatory environments or are sensitive about PII leaving their systems. They weren't too keen on us downloading data, either into a Python process, or into a Spark process.

Back to Basics: Linear Regression

We considered those. Then we thought about, what if there's another way here? This is a quote, I first came across this, it's like in Essays of Francis Bacon. There's a story in the apocryphal, but Mohammed is with a number of his followers, and at some point commands this mountain to move to him, and of course, the mountain doesn't go anywhere. He smiles and says, "If the mountain won't come to Mohammed, then Mohammed must go to the mountain." I like this because, if the data won't come to your computation, maybe you should bring your computation to the data. That will be the theme of the rest of this talk. What are we actually trying to do? Our specific predictive model was just a linear regression. It wasn't the super simple one that I showed you. Essentially, we are trying to predict the value of each metric using all metrics for a specific user. The basic linear regression equation, if you remember from a stat class or econometrics class, I'm trying to solve Xβ equals y. You're solving for β, y are your covariants or your inputs, these dependent variables, so these would be previous values of metrics. We'd have one row per user, one column per metric value, and then we want to predict each user's value for a single metric, and that's going to be our y column. Then β is just our best fit coefficient vector. Nothing too crazy here that we're doing.

The solution to this equation is this very famous equation, the estimated β, x prime x inverse, x prime y. I really want to just look at this equation and figure out if we really needed Python or anything too fancy to compute this thing. Our core insight was this, this x prime x matrix is a k by k matrix. If k is the number of columns or matrices, then the result of this product is not a gigantic matrix with a million rows or a million columns. This is a very small matrix. If we have 10 matrix that we're interested in, the x prime x matrix will be a 10 by 10 matrix. It turns out each cell is very simple to compute in SQL. If we have a constant term, then the top left term will just be a count of the number of rows. We all know how to do that in SQL. This top row and first column here, again, if there's a constant term, it's just going to be the sum of each column. Then each one in the middle is a sum of an x times another x. This is very easily expressed as a SELECT. It turns out these warehouses are highly optimized for these kinds of problems of multiplication and addition. Similarly, the other half of that equation, x prime y is also very simple. The first term is just going to be the sum of y's and then the other terms will be sums of various x's times y's.

What does that mean we can do? If we can compute those two terms in SQL, then we can solve the rest of the equation in the control plane or at the application level. This is just a little screenshot from our new code base. What this is doing is after we've computed the x prime x and x prime y using SQL, then we can just invert it, which is the second line of code, multiply the x prime x inverse by x prime y, which is the third line of code, and then get a varianceCovariance matrix out at the end with the fourth line of code. You don't need Python for this. Any library or any language that has a matrix library, and a set of special functions, if you want to get p values and things, is sufficient. We ended up being able to do this using TypeScript for our control plane.

What's the new architecture look like? It's something like this. We've got that big data still sitting in the warehouse, but now we just do a lot of extra compute in that warehouse. We do all those sums of x times y's in the warehouse, and then only instead of downloading the subject level data to our systems, we just download that x prime x matrix and the x prime y matrix to our system, do the matrix inversion, the multiplication, get the varianceCovariance matrix. I've labeled small data, since it's just those k by k matrices, and then get the same tiny data or results out the other end, which is just the actual experiment results being written to our production database. How did that work in practice? We saw some pretty big speedups. I just picked a couple of specific experiments that had been around before our architecture change, and after. We had one go from 9 hours down to I think 45 minutes, another one go from 9 minutes down to under a minute. This really speaks to just how long it takes to download data out of a warehouse, and also, just how well these warehouses do with sums, and multiplications. We have some fairly complex models. I think the most complicated is a 300 by 300 matrix, but BigQuery just tears through them. It does all the parallelization, the column level and row level for us. They're really good at what they do. We're able to leverage that in our new system.

OLS (Ordinary Least Squares) in SQL Architecture - Benefits

What are the benefits of this architecture? This system works across warehouses pretty well. It works on all four warehouses. Our warehouse specific code is just 100 or 200 lines of code per warehouse, so if we want to add another one, an engineer could do it in a few days. We use existing application language and infrastructure. This has been a really big gain for us. Prior to this change, we had a little Python universe and TypeScript universe. It was just hard to move engineers between those two code bases. We had all kinds of duplication just trying to communicate our data model from the TypeScript land over into the Python land. It could be just simple things like describing, what are the experiments? What are the variants? What are the treatments? How do we get that over from our main application into this external system? Do we need a microservice around it? All those problems go away once we're on a unified language, and able just to reuse frameworks, both for statistical and regular application code.

We don't move data out of the warehouse. This is very good for customers who value that privacy. We can assure them that we never see user level data. It makes the SOC 2 people happy, and everybody a lot happier when we don't actually see their users. Then, as you saw on the graphs, you get much faster results out of this thing. The architecture is a lot simpler. We're just able to rip out a lot of code, which always makes me happy. Then, finally, we don't see a lot of out of memory errors anymore. Previously, we just kept doubling the amount of memory that we tried to provision for these Python nodes, starting from 20 gigabytes, 40 gigabytes, 80 gigabytes, I think we topped out at 160 gigabytes, before we moved on to this architecture. As we were running the math, we weren't making much money off of our contracts with customers, because it was all going into these high memory machines that are really not that cheap on these cloud systems. I think the biggest node we have is 2 gigabytes.

OLS in SQL Architecture - Pitfalls

It's not all sunshine and roses, just a few pitfalls that we ran into along the way. Some of these queries do get very large. I was actually able to hit the 1 megabyte SELECT query limit in Snowflake, just with the sheer number of sums that we're doing for some of these large models. Also found out that Redshift has a column limit of 1600 in their result sets. You got to break up some of these queries into multiple queries. It doesn't have a huge effect on throughput, because the second point is you can parallelize just by querying concurrently. Warehouses are very good at just taking multiple concurrent queries, and then doing their own parallelization within each query. It helps to have a language that lets you do concurrent querying, like a Promise.all, which we use in our Node.js. You can still hit OOMs. I think we've hit a few just because in JavaScript, it's not like this super dense matrix representation. They're like boxed primitives and arrays and things. Not a huge deal, but just something to be aware of, if it's not like just this tight array, or tight matrix of floating point numbers.

Then another little got you that I hit. The second last point is, the literal 1.0 doesn't mean the same thing across warehouses. There's no such thing as standard SQL. Some warehouses, that's a floating point number, other warehouses, that's a fixed precision decimal number. If you're not really paying attention to this stuff, you'll just get these weird overflows or errors where it's trying to fix precision in floating point precision arithmetic. Just make sure everything's float, and your life will be a lot easier. Then, the last little got you, and this is an advanced point. We tried implementing heteroskedasticity-robust standard errors, and that got a little bit trickier because we couldn't use the same trick of reusing this x prime x matrix. It would have ended up being a lot more computation than we signed up for. For plain OLS problems, we have the same equation structure, or the same specification on the right-hand side, you can reuse that x prime x, which will make everybody happy.

Future Directions

Future directions. Where can we take this thing? I spent time over Christmas just trying to get quantile regression working on these warehouses. There's a really neat paper from Roger Koenker about what he calls the Frisch-Newton method. It happens to map really well to just the warehouse architecture or SQL tables. I had fun doing it. It's an iterative method. It seemed to work pretty well. We haven't done it in production or anything, but it was a neat little proof of concepts that you can do other nonlinear regressions if you'd like. The way that works, instead of minimizing the squared errors, it'll minimize like an absolute value of the error. Ridge regression is just another variant of OLS. You just add like a k times i term across the matrices. Pretty easy to add. That's what we do in our production code. That works as well. I'm pretty sure a lot of different generalized linear models could work here. In my previous job, I implemented a lot of these by hand in a non-server context, like Probit, Logistic regression, Poisson regression, all your favorite GLM models. A [inaudible 00:33:18] could be done, but with one caveat, I'll point out on the next slide.

Yes, or your algorithm here. I'm pretty excited just to get the results that we were able to get just going back to those core equations and scratching our head a bit and seeing if we could implement these in warehouses. A lot of these algorithms are really just dot products, they are just sum of x times y, like large language models that fall in this category. Warehouses are just hyper-optimized for multiplying and adding. It's what they do well. Let them do what they're good at. The one caveat here is, for certain types of models, you might want a special function that's going to operate at the row level, like a lgamma function, digamma, kind of all these special functions you might see in a statistics textbook that might not be available natively in the SQL environment. For those things, you might need to implement that as a UDF. Warehouses have all kinds of UDF solutions, either with Python, or JavaScript, or just implementing a polynomial approximation yourself. That might be necessary if you're going to do something like a Probit regression.

Why Consider This Architecture?

Why would you consider this architecture? Portability. If you're just trying to do ML across warehouses, there's nothing more portable than SQL. I put worse is better in quotation marks here. This is a reference to a famous essay about Unix, arguing that Unix became popular in the '70s because it was worse than the alternatives. It was easier to implement because it had fewer features. If you can do things in SQL, you can do them anywhere. If your warehouse specific ML bills are getting out of control, this would also be another reason to consider this. I think there's just a lot of unnecessary overhead with those systems both moving the data over to them and the type of compute that's done on them. Data movement taking too long, if it's being too annoying, think about this. if your ML stack feels fragile, if you're hitting all those OOMs as frequently as we were, warehouses are very robust. There are certain types of operations where you start to hit their limits if you're trying to do 50 subselects or do all these percentile calculations. For sums and multiplications, you're not going to break the warehouse, but you might break your Python process. Then, finally, of course, you're just bored and want to try something new.

Questions and Answers

Participant 1: It seems like the crux of the solution is that the number of covariants is off or too large. Because the problem in SQL you were having was you were bringing in too much data, but if you think about things in terms of covariants that stay small. How long is that true for? How many covariants do you have to add before you have to start thinking about the approach?

Miller: Our architecture currently we deal with about tens of millions of subjects or rows, and up to maybe 300 covariants I think is our largest model. If you saw the specific limitations, there was the Snowflake 1 megabyte query limit kicks in. We hit that with the first 300. Then there's the Redshift 1600 column limit. If you think of a 300 by 300 matrix, it's going to be 90,000 entries in that matrix. You're going to have 70 separate queries just to get Redshift to process through all of those. I think if you start going much beyond that, you just have to break it up into more queries. It's an engineering problem. Those are the constraints around that. I'm not going to put a hard and fast rule. If your application environment does not have a good, dense matrix representation, then you're also going to hit trouble if you're like 10,000 by 10,000, or something, and trying to represent 100 million objects, like boxed numbers. You're not going to be in for a good time. I'm sure people could take it further than we have. That's the way that I would think about trying to scale the system up.

Participant 2: You mentioned you did some query functions. You're generating multiple variants to tackle this problem. The speedup that you showed which is the 9 hours to 45 minutes, was there parallelization on the query side?

Miller: There's a lot of parallelization on the query side.

Participant 2: Even though the time is shorter, because I think these systems like Snowflake, the query time has at least a cost for a single query that's going to be parallelized. How's the cost side of this?

Miller: It does move cost into the warehouse. What we've seen is, it's not the big cost center yet compared to computing something over the number of events and facts, and condensing that into subject level data. It's about 50/50 between the CUPED part of things and the everything else part of things. It depends on the warehouse billing model. For Snowflake in particular, time equals cost. They bill you for the number of minutes that your Snowflake cluster is running. That's a nice one-to-one cost model. Especially, since the compute grows like k-squared, as you add tons of covariants, that's going to add computation quadratically. That's something to be aware of.

Participant 2: This subject, you mentioned like hundreds of millions? This is after applying CUPED to reduce the number of subjects you would need to drop through?

Miller: It depends on whether the customer decides to use the variance reduction to reduce the number of subjects in the experiment, or just to get more precise estimates using the same number. We can't tell them to use fewer subjects, but it could take that number down 50% as well.

See more presentations with transcripts

Recorded at:

Oct 02, 2023

Evan Miller

InfoQ Software Architects' Newsletter