InfoQ Homepage Presentations Zero to Production in Five Months @ ThirdLove

Zero to Production in Five Months @ ThirdLove

View Presentation

Speed:

Download

33:05

Summary

Megan Cartwright discusses how ThirdLove built their first machine learning recommendation algorithm that predicts bra size and style. She talks about the challenge of working with real-world data where there is no truth flag, and about the tradeoffs associated with key decisions they made around design, implementation and testing.

Bio

Megan Cartwright is Head of Data Science at ThirdLove. She entered data science via physics where she developed algorithms to predict energy transfer across the solar system.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

We are a small startup based here, in San Fran. And my team develops algorithms and infrastructure across all functions of the company. So we're talking about marketing, CX, product, finance, everyone. And I'm really excited today to share, actually, one of my favorite algorithms, probably because it was the first one we developed when we had nothing in product. And then we moved to machine learning in our product relatively quickly, within four and a half months.

So as alluded to, we make bras. And first, I'm going to describe the problem space to you a bit, because I suspect there's a number of you who don't really understand why we're doing this or what the problem is. And then I'm going to go into some of the real-world issues that we were faced with as we developed this algorithm and this infrastructure.

The Problem: Why Bras?

So why bras? Well, as you suspect, half the population wears a bra, starting at age 12 up to literally their last day. So that's a really long lifetime value. It's also a multibillion-dollar industry worldwide, and it's seen very little innovation in the last 60 years. So for example, bra sizes themselves haven't changed in over 100 years, and they were developed by a man. So that just makes no sense to me. I won't talk about all the issues of bras. But just so you know, to get a feel for it, one of the issues women face is their strap slips down. And so if I was up here in a dress with no sleeves and my strap slips down, I'm going to be self-conscious, annoyed, distracted, as might be you guys. And that's just a really small issue. This could literally happen to you every single day, many times a day. It's really annoying. There are all kinds of other issues when it comes to fitting bras. And so at ThirdLove, what we've aimed to do is try to fix some of these raw, core issues that have seen no innovation.

In the industry, what happens is you basically have two choices. One is legacy companies like Victoria's Secret or Maidenform. Another is fast fashion, such as like Zara or H&M, or whatever. So these are the two options that we've had for quite some time. Now, if you go to Victoria's Secret, that use case is you go into a room, some stranger touches you while you're naked and, hopefully, gives you some right size and gives you a bra. Then, you might wear that bra for the next 20 years, realize that you've been wearing the wrong bra for 20 years. Not fun. And then your other option is fast fashion, which is really bad quality, trendy, get it now or never, you don't really know what you're getting there. So at ThirdLove, we decided to make a third option, which is to actually make a bra that has good quality at a good price point, and at all sizes.

So what you see here is that most women, maybe more than 40% of the U.S. population of women, don't actually fit in that traditional bra sizing anymore. At ThirdLove, we decided to change that. My CEO, she likes to say, "If shoes get half sizes, why don't bra cups get half sizes? It makes no sense." So we're literally trying to change everything we can about this product to make it fit women better. As you can see, Victoria's Secret, half as many bras as us. We're actually launching more bra sizes next year. So we have the full core set, and then we try to do this extended set. We don't call it "extended" because we firmly believe every woman deserves a good-fitting bra. So that's just sizing.

Moving Bra Sizing from the Store to Online

So where does data come into play? So we come into play in this specific algorithm, in that we are moving bra sizing from the store, that horrible interaction that women have to go through, to online. And we do believe we can do this online. So what we do right now is, this is actually one of our landing pages, we call it the "Fit Finder Quiz." And in under a minute, you answer a series of questions. And using those questions, we recommend the right size and style of bra for that person. Millions of women have actually taken this quiz of ours in the last year, and over 70% say they have a very strong size issue. So this is a big problem that we really need to support.

In our team, our goal is to fit every woman to the right size bra at least the first time. I mean, ideally, we do it the first time. Obviously, you want to do it every time. But usually, people come in, and they do the Fit Finder one time, they get recommended a size and style. So we wanted to change this. So this Fit Finder Quiz right now, it's a series of questions. This is one of the first questions, bubble question, where you put in your current size. And there are 10 or 12 questions, and these questions are all coded up in a JavaScript React app, and it's all rules-based. So it's like, "if, else, if, else." It was created by our head of creative, who has been designing bras for 30 years and knows what it takes for a bra to fit the right body.

So using that content knowledge, we populated this "if, else”- I call it the "rules-based algorithm," not an algorithm. And we ask a series of questions. So here's another question. Here, we ask what brand you're wearing, because different brands have, actually, different styles attached to them. So here, you can write in "Target," for example. Also, how well does something fit? Is it a little? Is it a lot? And then, using all of these questions, plus more that I'm not showing, we actually recommend to you your size, like 38A, and your style. In this case, it's a Plunge. There's actually like three main core bras, and Plunge is one of them. Plunge fits women who are particularly narrow-shouldered or short. And so this is actually one of our core bras that we would offer.

The Real World is Full of Messy Data

The company knew that a rules-based quiz wasn't going to last very long. They knew that they were collecting all this rich data about how women were fitting and feeling in their bra. And they knew that if they could get a better feedback loop using a machine-learning algorithm, we could actually, over time, fit women better. So that's where my team came in. They asked us to solve this problem. So this is where I think we're going to all be much more interested. But it's really, "Where is the real world in this?" And so the rest of the talk is going to be about the real-world consequences of using this data and how you build this infrastructure.

Right now, the first thing we did was actually look at the data. And it's really messy, as you could suspect. One is those bubbles – that's structured data. So, yes, I get your bra size. You're a 38B. Awesome. But then, if you notice, there was some other data that was more freeform text, which is a lot harder to parse through and understand. There are a lot of ways to do it. And so obviously, you want to pick a way and hope that's the best. But you can't spend forever, because we're a small startup and so you only have a few days to figure this out.

Some of the other things that you have to worry about are, "Where is the data stored? Are you tracking it all?" What we found was some of the data was stored in parse in our back end. Our orders data was in Shopify, and so you had to link that together. You also have exchanges and returns. And how does this data all play together? Is it in the right format? Are there duplicates? Because usually there are. All this comes together, and you have to put it in a place to actually create what we call a "training dataset," where you actually build a model off of it.

So like I said, the first thing we did was get all the data and put it in one place. So we put it in Redshift, and we're like, "Okay, what are we going to do? What is it telling us?" And so one of the first things I asked the engineering team was, "Are we actually tracking all the data?" There's clicks, submits, there's all kinds of stuff. What we're interested in is sessions. So, “How many sessions did someone do? Did we track all that?” “No. We aren't tracking all that. We tracked the very last time they took a Fit Finder Quiz.” That's not helpful. If we're going to build an algorithm, we actually need to know all the data. And that was just one example.

Real World Algorithm Design: How Do We Define Success?

Let's get into it. In terms of building an algorithm like this in the real world, how do you design it, and how do you measure success? What you see on the right-hand side, is actually the Fit Finder would recommend a size and a style, and you're like, "Awesome." Then, you look at the next. And then we parse all that data together that we saw before and we're like, "Did they keep it? Did they exchange it? What was their final kept size and style?" And then to the right is all the features, some of the features we put into the model that we're playing with.

So the really important part here is the feedback loop. So what is it that you want to make better? You want this fit to be better, right? But how are you going to get there? Is it conversion? "Oh, they loved this bra. They converted. We're done." But what feedback do you get from that? Nothing. Is it returns? Do you include returns? Because they didn't like it, obviously, they returned it. But why did they return it? We don't actually know. So you should track that, too. And then also, exchanges. We have exchanges. So if you look at all of this data and put it together, you can actually … What we decided to do as our feedback loop was actually, what did they finally keep? You can actually exchange a bra with us many times because we're dedicated to getting you into the right fit and to the right style. So this is actually really complicated to track and to pull and to push together. But this is what we decided to do as our final feedback loop, was the kept.

So now, we dug into it. Well, we have all this data. We have a feature set that we know we're going to track to, which is kept size and style. Now, what does this data show us? So we started tracking all the sessions. Awesome. What we found out was that people were doing it, they were doing this Fit Finder Quick 3, 5, 10, 15, 20 times before or after they make an order. So which session is the truth session? And there is no truth session. Right? Is it the first session? Is it the last session that they're most serious about? Is it the session in which they made an order from? Is it the session in which they got the same size that they ordered from? Because you could actually change your size when you're in the ordering process. I don't know.

The way that we decided to attack this was the first alg we built, we totally cheated. We used sessions that people had only done one session, so their first and last session was one. And we had enough data to be able to build a simple alg off that. And then we decided to think about all the different hypotheses that we had brainstormed. Is it the first session that they’re most serious about? And somebody was like, "No, it can't be. Because that's when they're in the subway, and they're just looking at it. It's the last session before the order." I'm like, "Okay, fine." So we actually, literally, are testing training data associated to each session using experiments. That's how we got around this one.

Real World Biases

And then we're like, "Great. We have a model. It's awesome. We have all this training data." And then we start looking at it, and we realize that, there are some interesting trends here. And I think this is a really important piece that's really been weighing on me lately, is that your training data has innate biases. Amazon just recently came out saying they had to throw out a couple of their algorithms because there were innate biases that were overpredicting things that were not something that they wanted to see. And so for us, it's the same thing. And so here, one of the things that's interested about bras is people become very connected to their bra size. "How dare you tell me I am not a 36C? That is my size." And you see this in the data, actually. We would see people doing the quiz multiple times trying to get back to the same size.

And so we actually did some simple clustering to find this pattern in the data. And so there's a subset of people, we correlated it back to what we heard from the return data. And really, what we heard was this; this like strong subset of people who want their same size no matter what we recommend to them. So that means, how do you deal with that in the data finally? Do you remove it? Which is what we did; we removed that subset because we actually do want to predict the best size for you. But we also are building a new product to recommend a couple different sizes to you. So then, if you want your original size and your recommended size, you could have both and then just return which one you don't want. That's how we approached it.

Another interesting thing which a lot of e-commerce has to deal with is actually the "lazy returner syndrome" I like to call it. We knew it was going to be a problem. But we actually knew that people who said, "Oh, I want to return," and then never actually sent the product back to us, was a small but significant subset. So how do you deal with this in your dataset? And for us, we just decided to go back to our target metric, which is the feedback loop. And just keep iterating on that and making sure that that's a really strong connection, that we're getting as much feedback as we can from returns and from exchanges, so then hopefully we can identify these people. But it's a very hard problem to deal with.

And then, finally, the last bias I'll talk about is just returns. So people return it and they say, "Oh, you know, I don't know why I returned it. I just don't like it." You know? "So why don't you like it? Give us some information? Did it actually fit you, and you just don't like it? Or you thought it was too expensive?" So ways you have to get around this is really asking more questions in the Return form, which is painful to do. But hopefully, you can make it in a way that's easy to do. Also, you can look at those customers and try to relate them to other customers within your training database and then maybe remove them, which is another trick you can do. But we didn't do that.

Real World: Cold Start Problem

Great. So now, you have your training data. You have built up a simple model off of it, and you're putting it into production. And then the VP of operations comes to you and says, "Hey, guess what? We're launching 50 new sizes next month." Like, "Oh, awesome." We don't have those sizes in our training dataset. So this is called the "cold start problem." It's a well-known problem in machine learning. For us, we actually made an algorithm for all those green and yellow sizes. But those red sizes, we literally launched while we were testing the green and yellow sizes. So we actually decided to pull out the yellow sizes and use the old rules-based algorithm, because we didn't want to poorly recommend a yellow size into green, when it should have been a red size.

And then we had to develop a strategy for how we want to deal with cold start, and we came up with two approaches. The best approach is always to get more data. And so, for us, the velocity which we gain data, that wasn't a problem. So we could actually do that, to wait for the data to come in. There are other approaches where you actually use a more Bayesian stats approach, where you find priors, where the probability that someone has these features and will convert is similar to other people. Luckily, we didn't have to deal with that problem.

And then there are other business considerations as well, when you build a real-world alg, especially for a small startup where maybe your inventory is closely monitored and watched. So here, on the right-hand side is our basic T-shirt demi-cup, and on the left is the box of the same exact bra in different colors or styles. So you have lace or, actually, that turquoise one has a different covering on it, so it's like stripe-y. And so what was interesting was even though we built our alg off this training dataset, what we found was people were ordering different things and that they were being recommended. And so, that's great. But how do you deal with that? What if we don't have enough of the turquoise lace overlay bra in stock? We can't recommend that to our users. So we actually had to dial our alg back and put some guardrails in place to make sure that we were only recommending the bras that we knew we had enough inventory for. Otherwise, you're going to be in trouble. So that's guardrails, Part 1.

And then what model do you finally choose for production? So we have spent all this time thinking about which model to choose, how to build the model, the training set, been very thoughtful. So sizing of bras is really, like, retarded because those sizes are not necessarily linear. The band size is linear, but then there's something called "true cup size," which is linear; it's volumetrically linear, but then the size of the bra is not linear. That's weird. So that's why I say sizing is not well done.

But anyway, so you can build a model that's linear, and we did. When we were building and testing out the training data and building out the infrastructure to support it to the client side, we decided, "Oh, a linear approach, super-simple, easy." Always start simple and easy. And it was a very small file. So you take the model, and then you serialize it, and it was really small. But then, at the same time, we were also playing with random forests and ensemble methods, because we knew that we needed to get a really good accuracy as well. And the accuracy on the linear model was not so great. So we're still working on that.

Creating a Compatible Model API & Infrastructure

And then we decided to build out the infrastructure at the same time. And we're building infrastructure where our site is actually a Shopify site still, and we use parse in the back end with a JavaScript React Fit Finder Quiz. And so we knew we wanted to build something very simple that worked with our stack, something with Python and scikit-learn and so it'd be easy to use. And then we decided to deploy, from a machine learning standpoint, the ensemble method, the random forest method. Which is a much more complicated algorithm. And when you serialize it, which is that little weird "01" green section, it is too big to fit in that simple little infrastructure that we decided to go with. It wouldn't fit in DynamoDB. DynamoDB has a storage limit of 400 kilobytes. So that's fun.

Our random forest, when you pruned all those trees back, it was not performing very well. So we're like, "Well, we can't use DynamoDB. So what are we going to do?" Luckily, we have some great engineers and machine learning engineers, and we just decided, "Screw it. We're going to put a link to S3, and then we'll store the models in S3. And that way, we can have our scientists or data scientists be able to export the model into S3, and then use it in production and then pull it in.

Our Design Overview

So what's that look like? This is just a very sketchy overview. I would say it's a very classic overview, where models are uploaded by the Data Science team. We have a promotion scheme, so you have different classifiers, and then the API itself actually calls a predict function and fits the size and style, sends it back to the client side. And then we have some A/B testing stuff on the front end. And then, if you see, all that data is logged and sent back to Redshift, which is great.

But we decided, in our validation process before we launch anything live, that we're going to launch it in Shadow first. So the data scientists export the model in, we need to test it, and Shadow allows us to have all requests be evaluated using any one classifier and then save that data in Redshift. So something that was really interesting to us was, we'd done all this work, and we failed to recognize that our styles that we were recommending, were actually significantly different than the styles the rules-based algorithm were recommending, and that's because that's not what people were buying. We had put guardrails around, "Oh, it can't be lace, and it can't be colors. It has to be core, '24/7' styles," we call them. But we had failed to realize that that could actually have inventory concerns. If we're not recommending anybody this one style, then there's going to be a lot more orders in this other style. But that's what they're ordering anyways. I guess, that was my philosophy, so I was like, "Deal with it. Order more." So anyway, that was just an interesting Shadow thing.

And then we decided to deploy it live. Which is always slightly terrifying to deploy an alg live, especially when it's the first alg, and you've never done it in product before. So we just tested a classic A/B test, our alg against rules, 50/50, and keep rate is our primary metric. So what we found was that we did, over time, we won. And at the same time, we developed infrastructure to be able to retrain that same model on a monthly cadence, where we're weighting newer feedback data more than the old feedback data. As you could imagine, we get better and better with the bra product and the recommendations, and whatnot.

Where Are We Going?

So that was our timeline from building this JavaScript app which collects data, building the machine learning models and infrastructure, and then deploying them live to the front end. So where are we going now? Well, we're all about real women and real women's problems in this space. So we are developing algorithms across the product, the site, the experience, returns, exchanges, next orders, what we should be recommending, what kind of imagery we should be showing and how we show it across different markets, and things like that. And then, inventory predictions as well.

So if you are interested in machine learning and machine learning infrastructure, we are hiring as well, for machine learning engineers. Thank you.

Questions and Answers

Participant 1: I had a question about the architecture diagram that you had. When a model is promoted to production, what's the mechanism by which it's switched in your API?

Cartwright: Yes. We have a couple schemes, actually. Well, we're testing multiple classifiers, because we're actually building different models to predict the same thing. So we don't actually have a strict promotion scheme about which one we switch to. What's it called? We just take one out when we decide not to use it anymore. And actually, the way it pulls up a classifier, we're kind of cheating, today. And I assume you won't be cheating. But in our A/B test split, because we're a Shopify site and we actually don't own the load or anything, we have to call the classifier up by using the A/B test flag that we use.

So we use VWO currently. We're going to actually move off to something else, so we can do much more sophisticated testing. Because right now, we don't have any capability to do multi-armed bandit or whatever, which is what we need to be doing. So, in the A/B test flag that we build, it has the name of the classifier, and that's how it actually loads in. It's cheating. But we're a startup, so I can do that.

Participant 2: So for your models, do you do any online training based on the new data that you get in? Or do you have to retrain your models every few weeks or whatever and deploy them?

Cartwright: Yes, that's a great question. And right now, we have to manually retrain them. I mean, we retrain them on a schedule. So we pull the data, retrain the model, and then we have to push the model. But it's still somewhat manual. We're actually changing it right now, so we have a better infrastructure. This is a cheap and easy way to get into the product quickly. But now, the team is like, "No. We want to do a continuous deployment. We want to do like a proper blah, blah." And so we're moving in that direction though, so we can do more real-time. It would be nice to have, especially, on some of the new styles and some of the new testing we're going to be doing, to be able to do more real-time work.

Participant 3: I was wondering how much you think you suffer from survey fatigue? Right now, this year, I've kind of given up on most feedback requests. And the only ones I really do are the ones where I think I'm going to benefit. And even if it's for the benefit of the company to develop stuff, then I see how it might benefit me. So are you trying to measure that? Are you trying to get a handle on that?

Cartwright: Yes, actually. So we only require feedback when you do an exchange or return, in the product to get the return slip, and we're rethinking that design even so. Because still, you make an order, and then we send like a [inaudible 00:29:10] review. And that's actually one way where I'm trying to understand biases in lazy returners, I call it the "happiness factor." And so I actually just got enough data to be able to do a little bit of analysis around how happy people are, and then trying to identify who those people are early using an algorithm. That's what we're doing now. We try to send like 1 in 10, but it has to be random enough across geography and all that, and it's kind of hard to know sometimes. But I think that's a huge problem.

Participant 3: And do you think your people who are keeping the bras, then a lot of people don't return things because the buyers don't return things? So how are you trying to navigate that one?

Cartwright: Yes. That's a really hard one. I would say it's the hardest one we have. We actually tried making an algorithm to identify these people via actions they were taking, and it wasn't very great. I can actually build a model to predict if you're going to convert on our site or not, way better than this, which has even less data. So I'm not sure how we're going to address it entirely yet, other than getting some of these reviews and trying to identify them. I think in time, we'll get better. But it's a hard problem.

Participant 4: So it seems like quite a few of the data points that you guys ingest are free text fields. And so for those fields, in particular, what kind of standardization logic do you have in place and deduping, and how much have you seen that might effect your algorithms as they evolve?

Cartwright: So interestingly enough, there's only really one, the brand question where you can put "other" in, that really is a good feature for us, is a really good predictor. At first, we manually went through it, to be honest, and then parsed it that way. Actually, we decided that, "Well, that's a lot of effort and it would require …" There are a lot of different ways you can do this, like text parsing automatically, but none of them are really capturing what we needed to capture.

So at the end of the day, we had net phrases like, "Okay, 'VS,' Victoria's Secret" – even though Victoria's Secret is an option, so I don't know why it would – that we used, and then we grouped them. And then we actually grouped them into, "Are they Victoria's Secret-like?" so they were looking for sexy, lacy stuff, or, “Are they like more every day? Or are they more fast fashion?" And so that actually came out pretty clear in the data. But, yes, there's no easy way to do that, I would say, especially if you want to do this quickly, like real-time. Awesome.

Participant 5: How many data points did you actually need before your machine learning algorithm gave you good results?

Cartwright: You know, honestly, I don't know how many the minimum threshold was. I started at 90k observations, which is a very small subset, but it was the cleanest subset I could pull. We're actually using it in product still. So not that much data, actually.

See more presentations with transcripts

Recorded at:

Jan 15, 2019

Megan Cartwright

InfoQ Software Architects' Newsletter