BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Policing the Capital Markets with ML

Policing the Capital Markets with ML

Bookmarks
36:54

Summary

Cliff Click talks about SCORE, a solution for doing Trade Surveillance using H2O, Machine Learning, and a whole lot of domain expertise and data munging.

Bio

Cliff Click was the CTO of Neurensic, and CTO and Co-Founder of h2o.ai. Cliff helped Azul Systems build an 864 core pure-Java mainframe, and worked on all aspects of that JVM. Before that he worked on HotSpot at Sun.

About the conference

QCon.ai is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.

[Note: please be advised that this transcript contains strong language]

Transcript

Click: I'm Cliff Click, I'm currently at a different startup [inaudible 00:00:07] Neurensic. I did this some years ago. Neurensic exited and technology was the law. I'm now currently doing an IoT thing. I've done a lot of stuff in my life, I've been coding since I was a small kid, my first compiler I built when I was 15, forever and a day ago. I did the core guts, a Java virtual machine, I did the GC, the compiler was in it, I did a bunch of cool custom hardware for dual systems. It involved doing low latency GCs as well as get another port to the compiler, non-blocking algorithms, distributed stuff, OSs, device drivers, a lot of HPC stuff, all over the map.

About a decade ago I did a startup called H2O, I'll talk a tiny bit about it here. They're still in business, but it's a platform for manipulating big data and math, it's somewhat faster than Spark. In my opinion, somewhat easier to set up and somewhat harder to use, and that was maybe the business difficulty, but it remains 10 times faster than Spark, it's hugely faster, and which on a terabyte worth of data means I can do ML super-fast.

Let's talk about what we're doing here, all of our economy is valued and priced by the capital markets. What you pay for your car, for your electricity, the gas, the food you eat, everything has a value determined by the capital markets. This stuff is bought and sold, the entire volume of the world's economy is measured this way. Nearly all of the U.S trading is done with automated trading algorithms, it says high frequency, it's not low latency trading. There is a lot of low latency trading, don't get that wrong, but it's high frequency. There's a high volume of trades going on and same thing for the futures and commodities market. Commodities will be stuff like oil and corn, and food that you eat and stuff that you burn through on a regular basis, whereas the valuations will be stuff like Microsoft, IBM stock, Apple stock.

Most of that stuff is surveilled, looked at for fraudulent behavior in a fairly manual process, using an '80s and '90s rule-based engine. It's shocking when I got in here, what people were doing to look at trade, I literally saw people cut and paste from IBM 360 Green Screen emulators into Excel spreadsheets. That was what the compliance officer is doing when he's looking for fraudulent trades on a billion trades a day. There's no chance he has of catching this stuff.

Neurensic – Forensics in the Markets

What we did was we built a tool for doing market forensics, for reading the stock ticker tape basically and look for illegal activity. We're not law enforcement, we're tooling, and so we built a tool and then we handed it out to everyone who was interested and we sold it to trading firms, but we gave it to all the regulators. The goal here was to help tighten up the fraudulent behavior that runs through the capital markets, which turns out to be a shockingly high amount of it is going on.

Not just the New York Stock Exchange ticker tape, but tickers from all kinds of exchanges. The New York Stock Exchange is a big one, everyone knows it, you see it on the news every night. Most people also see NASDAQ as an exchange, some people also see commodities exchange, but the commodities market actually is a higher evaluation. It has an equal order magnitude valuation as the New York Stock Exchange, because the shit you use every day you have to have every day. It's important to you to have it, whereas Apple stock, you own or you don't, or you fire and forget on your mutual fund.

Maybe about a trillion rows a day coming through, but a big firm might see a billion, but a small and mid-sized trading firm might have still tens of millions of rows in a day. They were using these very old school rules-based engines which would have a 95% false positive rate and we decided to apply ML and see if we couldn't change the game.

We are not declaring what is illegal, we're a tool builder. Dodd-Frank, the law, says that the intent to deceive is illegal, so we're trying to build an ML to detect intent. What the hell does that mean? Actually deciding what intent is requires a federal judge, somebody is going to go to jail or at least going to have a court case. What we would do is look at your behavior and compare it to other behaviors that were considered fraudulent, had been prosecuted or investigated, that people knew was shady behavior. We would just give you a score from 200 to 800, the ML came out with a zero to one as a ratio and you just scaled it, but once you got a pattern that was clearly risky, the next steps are go to a compliance officer who would go double check the ML, and then whatever happened after that depends on what the situation determined.

One of the major requirements here is, because we are looking at criminal behavior, there is a need for it to be transparent, utterly transparent. You have to be able to explain what it is, why the decision you're making, what you believe to be illegal about it. Just finding it is the first step, you have to explain why what you're looking at is questionable behavior. The machine learning is typically very opaque, it might be very correct, it might give you a great answer, it might also be incredibly difficult to tell why it gives you that answer.

We have to be able to justify our results to a federal judge, we don't justify the results. Instead, we show you what goes on. We show you what the trading firm knows, we show you the internal audit logs, what the trader's activity was, attempts to trade, his positions, his buy-sell offers, we take in the public data which would be the ticker that you see on the TV news, the bid-ask spread, the volume traded, cancels and historical trends and stuff, and out of the billions and billions and billions of rows you have to filter and filter till you get something that's humanly legible, and people will hand inspect this before anyone goes to jail.

Visualization of Raw Data is Key

You have to use the actual ticker data, not the machine learning results, because the actual data is hard legal evidence. If you're in a trading firm and you are trading against an exchange, every UDP packet you send to that exchange and vice versa, you must record for seven years, those are the audit logs. The data is this messy, horrible thousand bytes per row roughly kind of thing, there's a term in the industry called symbology, it is the mapping of text symbols to corporate entities. It changes over time, there are duplicates, there are dropouts, there are misspellings, there are all kinds of stupid things in there. The symbology is necessary for everyone to understand before you can show that you've got the right answer.

Then you have to visualize these patterns you're looking at, you end up showing the trades in real time, or tick by tick or slow mode time, and you have to show the trader’s positions, what he purchased on either side of the bid-ask spread, when he canceled and why he might have canceled. The book was the outstanding market bid-ask spreads, and you get visualized displays of a number of things here that are very well understood by the trading community, but maybe not understood by lay people. I certainly didn't know what the hell was going on when I walked into this business. There was a big learning curve going on, but the key answer here is, your data providence has to be basically perfect from raw data to visualize display but you're doing a filtering using machine learning, you're going to discover the needle in the haystack.

We kept working on these displays for a long time as part of the corporate business. You had better and better visuals for existing patterns, for new patterns, as you could reliably label certain kinds of fraudulent behavior, the behavior would change. People would bail out and try something different, we ended up adding new patterns or some of the old patterns got kind of stale because no one was hitting it anymore. You had some GUI you put together to show off some particular kind of trade that wasn't happening anymore.

The people we talked to had all been trading for 40 years, used to a thick client desktop, but they're all ready to go to a browser. Browsers are everywhere, you don't have to do an install on it, you just have a webpage, you can run the HTML through your firewalls. They all have very strict corporate security policies going on, they all wanted mobile, so we could go to a browser on mobile. A compliance officer might look at something and vote yes or no or carry a pass or whatever, but when he covers something that's ugly, he reports directly to the CEO and is required by law to report to the CEO, because that's a firm-ending behavior if you don't act on it.

I know people don't understand this, but in a trading firm, they have money and the ability to trade, but they don't actually do the trades themselves per se, they bring a trader in who understands how and what he wants to trade. He takes their money, he does some trades, at the end of the year they take their money back and then any profit he made, they split. He makes more money then it's like, "Oh my God, this is great. Nudge, nudge, wink, wink, go make some more." That's the kind of thing going on, but the trading firm is responsible for the behavior of the trades that pass through their house.

If they know the trader is trading illegally and making money at it, because that's their goal, they can be charged for fraudulent behavior as well, even though they themselves are not deciding what to trade. In fact, that's what happened to Oystacher and his trading firm was declared no longer allowed to trade, it was basically put out of business, that's a firm-ending behavior. You want to be able to hand this to CXOs and lawyers, which you want to hand mobiley.

The Data

What does the data look like? There are trillions of transactions, only the tiniest fraction of these transactions ever get expert eyeballs on them to get labeled. I get a few new labels every day, I've got a couple of guys doing compliance, looking at trades, I have them specifically with things I wonder about, how the model is doing. There's some they're looking at for other reasons, you end up getting a dozen labels a day per compliance officer and that's sort of it. You also then had external clients doing compliance who are looking at results of your model and would label how well the model did saying this was missed or that was good or bad or whatever, but you wanted it to capture all of them because you had so few such events.

You had to go teach these guys how to capture results. They decided, "I look at this set and it's not really that bad, but it's not great. Out of 200 to 800 scale, I gave it a 400." You give it a number, that's your expert labeling. Make them write it down somehow and I get it into Git-hub.

What's the data look like I'm working with? It's these '80s era columns or texts logs, it's only ASCII. There are a bunch of errors in here or commonly, there's a raw barren area that gets mixed in because somebody had an error in their logging code and it puked guts into the log file. There are variable length columns, there's missing columns, there's extra columns. There are special text codes for many columns. The exact format varies from vendor to vendor to vendor, and from trading house to trading house, so you have to do all kind of custom tooling to each one of these things.

When I get the tooling in, I get things cleaned up, I'm looking at a cluster of trades, not an individual trade. An individual trade is not really fraudulent by itself, because the trading firm has validation in place for that already. You can't trade something unless you have it and you have it as a proof that you've given to the trading firm or the exchange before you ever got there, the same for buying it, you had to have the money. Fraud comes in when you have a cluster of activity that does something that people consider bad. We have to cluster the trades, you get this infinite sea of trades coming through and you build some sort of cluster, and the cluster is generally things are close in time, things are done by the same trader on the same instrument. Sometimes it's a couple of traders, that's rare, sometimes the close varies by the kind of instrument. Some instruments trade very rapidly, some trade very slowly, and a cluster of activity might be spread out in terms of seconds to days, that kind of spread.

Very commonly a cluster of activity would span around a minute, but it did vary significantly. The unit of machine learning was in the cluster, it wasn't one trade, one UDP packet to and from the exchange, but maybe 100 to 1000 such things. Because I had so few labeled data, I couldn't do classical supervised learning, some tiniest fraction, not one-tenth of 1%, but one-millionth of 1% or something stupid like I have labeled. I have some thousands of labeled clusters, and I get more labeled over time but it's not very big, and I build a model on the labeled thing, it's a supervised model.

Once I have that supervised model, I run it on the whole dataset. When the model declares that it has high confidence on another cluster that's not labeled, I give it that label. If it has a weak confidence level, I don't. Then you rebuild the model and you iterate a few times and it closes in. It's not perfect, but it's what you can do when you have such a tiny amount of training data to go after. This is not like a case where you train on two thirds and you test on one third and get an AUC out of that. No, I'm training on much, much less than a tenth of a percent thing.

The Process

Along the way, of course, there's bugs. In the cleaning steps in the ETL, you thought you knew and you didn't, you discovered something new, some of it was just straight up coding bugs, some of it was, oh, "This was a weird thing. What's this mean?" You go back to the compliance guidance, "Oh, yes, these guys do blah, blah" and suddenly you're all messed up and you'd have to go re-run your cleaning steps. The output of the cleaning in ETL phases would change the data your training on it.

You would have feature generation which keeps getting changed. The data scientists are manipulating new features all the time, or in how they train, or the models you're using. We tried quite a few models, we settled on random forest as a technique that is very hard to overfit. Because we have such a small amount of data, it's very easy to overgeneralize, and so RF basically cannot overfit. Realized that the course of getting new data in, we're getting new labeled data and so we want to be able to rebuild and revalidate the entire workflow process from scratch every time.

We built an ML pipeline, and we boiled it down to where we can type make, and get the entire codebase and different pieces parts, to fill in pipeline about five minutes. If we had to retrain, that was one to two hours, depending on how much data we're going to retrain on. The major goals here was that you could in fact then change your algorithm, change your data, change your future generation, change the expert scoring, change whatever came through and rebuild and go again, because it's a continuous improving process. Getting that first cut through was painful, but after we closed the cycle and could rebuild the process every time, things just kept winding uphill and you kept getting better and better and better results.

The end result was, most customers would happily take a new whole build every week. These are very conservative big businesses doing very high-value trading technologies unheard of in the industry. They're like, "Yes, this is cool. Pick another." The next one comes in and they're like, "Oh, ok." Then they go look at old results and new results and see what changed and things get better and life went on.

SCORE Architecture

Now, let me talk more about what we actually implemented as an architecture. There's some internal audit logs, these are stored on the write-only drives. That path to take the data coming from the UDP packets and putting it into a write-only archival storage was required by law. You have to hang on to data for seven years and a well-greased, well-understood path for these trading firms. Somewhere in that pipeline we teed off and took some data out, we would read the logs and then we'd put it into our score server, which would go crunch and do the ML and produce results, which we would also save for all time but also bring out for the browser to view. We had to save the results for all time because if a compliance officer said somebody was not committing fraud, and three years later a court case comes around and they decide this guy is committing fraud and the new models are declaring it fraud, why did the compliance officer not declare it fraud three years ago? He's in collusion with the trader, or he got results that were available at that time. We have to be able to present you what you saw three years ago, same as the UDP packets going through it from the exchange so the compliance officers could be cleared of wrongdoing here.

We used H2O, the data size involved are easily within the H20 cluster size. I hacked this slide and lost it in a save on the train, I actually ended up doing a terabyte on a single server. Bigger machines these days, and these training firms are all used to putting in a custom machine for every application that is custom tailored to whatever size, and a big fast machine is totally normal. The biggest people needed to put in a terabyte node, a 64-core terabytes node, but most of the smaller guys would buy a pizza box that have 64 or 256 gigs.

Bleeding edge state of the art ML algorithms, we ended up doing a random forest because that worked and didn't overfit. The data scientists were all done with Python, they all loved their Python and they hacked Python and, of course, H2O is written in Java and it's only run in Java, so we have a Python and a Java bridge we have to cross.

SCORE Internal Design

The design is looking something like this, we're reading in this gigabytes to small terabyte count, we're pulling in the H20, which puts it into a giant ass 2D table. We clean it, get all the crappy things out of it, we sort one of three different paths, depending on which ML algorithm is running. We look for about a dozen different fraudulent behaviors, different algorithms for every piece of fraud, different ML models for every fraud. We broke them up by clustering differently according to the kind of models that are being run. We run the different models, those are the green things, the spoofing or abusive messaging or wash across ratings conduct that. You get a risk score out, which is your 200 to 800 number, plus enough breadcrumbs that you could go back and find all of the actual transaction rows out of the billions and billions to go validate that risk score. The filtering logic, not the rows, we didn't duplicate the rows, that would be too much, but the filtering logic was baked into the risk file as well.

We'd read an audit log, we would decide which of several versions of ETL to use. There are about 50 cleaning steps done on all trading packets and 20 to 50 more that varied by the format of the audit log, which varied by vendor and by trading house. We ended up dropping or imputing missing values, do a lot of normalization, a lot of the products would have different symbols according to which trading firm was trading it. This is a symbology problem, not everyone was using AAPL for Apple, for instance, it varied. Traders and account normalization had to happen as well, the same trader would show up under different names in some columns and under the same name in others and you had to sort those around. A lot of the things like buys or sells would be different tokens according to what the trading firm did. You might have the word buy, you might have B or 2 or K or heaven knows what, so a few hundred individual cleanup steps. Each individual file went through roughly 100 cleanup steps.

Once it was cleaned up and you had a clean 2D table, we then had to break it up into clusters, and the clusters are never uniform. The clustering technology was paralleled and scaled out nicely, but we broke according to, and you can't quite see it but there some of the lines are breaking, the Apple and Indaq, and some of them are based on timing between things. Whatever the rule is for the kind of model we're running, you would break up the clusters. On a big data set then you'd have a few million clusters.

The clustering rules were written in Python, because that's what the data scientists team was part of the model building process, basically part of the features. They wanted to run some Python for every row to go walk through a billion transactions, Python being what it is, we said, "Ok, let's go faster." We would run it in Jython, and then because we're running it in Java, we can go parallel, so we're running parallel Python, so we auto paralyzed our Python, basically. The limitations, of course, you had to have a very simple kind of Python to do that.

Once you're done, you've got these clusters, the clusters would then be 100 to 1000 rows, so you might get 1000 to 1,000,000 clusters depending, and that's your thing that you're going to look for intent on. I've seen on the bottom, here's is a size trade, wash trades two rows, there's two guys, wash trade is a wash. Two guys just buy and sell stock back and forth, they're looking to give the illusion of volatility on this instrument, or liquidity on the instrument. Whereas abusive messaging might be 10,000 trades in a microsecond, whereas an exchange will cut you off, you can do more than 10,000 in a second because that would be what they would consider abusive.

If anyone's doing front running, they see a trade come by, they think it's a big one, they're going to run a front run it, they're going to go to the other exchanges, puke out 10,000 trades in a microsecond to stall everyone else, while they execute trades against the thing that happened on this different exchange over here. This is a classic flash boys kind of behavior. 10,000 trades in a second is ok, but 10,000 trades in a microsecond is abusive, it's varied according to what you're looking at here. A spoof attack might take a couple of minutes to run and so you might have trades running on for several minutes, it came and went accordingly.

Once you have these clusters, you now would do a feature generation on the cluster again with Jython. This is the data scientists writing Python code, or a CPU would grab a cluster and it runs this Python model or it runs a Python feature generation, basically building a state machine, walking through the set of trades. You added, you sold, you canceled, you got a fill, you didn't get a fill, whatever is going on, and you build some feature generations. After that, you would then apply the model and get a risk score out.

The market is inherently sequential, it's a giant ass, big state machine where this happened, then this, then this, things are strictly ordered at the exchange level. The time that UDP packets make it back to the trading house, they've reordered in the network. By the time they get recorded and logged to this, they got reordered again, but there's timestamps, you have to sort on getting things ordered correctly. You ultimately get an order of events for which you can build a position, what the guy was doing, what the trader was doing, you can see the guy say, "I'm putting a trade in that's out of the market." So out of the money means it’s either too high or too low. He wants to sell it but he's asking too much money, he wants to buy but he's not offering enough. He's putting trades out of the market, and he builds and builds and builds adding more and more and more trades, so he has a large position in one side or the other of the market, and then maybe he cancels.

As you add trades in one side of the market, people start assuming there's a pressure to buy. They just hadn't heard the news as to why you want to buy or why you want to sell, and the market is going to move in that direction and so the market drifts towards the side that you have that pressure on, and then you cancel and the market rebounds rapidly the other way, and you do a reaping fill on the other side of the market. Then because the market went the other way, you start adding orders on that side of the market to get it to go that direction and you induce a sine wave, is the effect.

This is a spoofing attack and I can talk more about that later if people want. Oystacher went to jail for spoofing attack, he was making $100 a minute, so some tens of thousands a day, to many millions a year, it's good money. That was not the largest spoofing attack we caught by two orders of magnitude. There are big dollars involved here.

Ultimately, the data scientists write Python, and they would be handed an array of rows from the Python point of view that they would then walk through one after another building their state machine, which they would then build the feature sets out of. We were limited to what you could paralyze, basically automatically from Python. We had to say no global variables, so function local variables only, no native library calls, because they had to be thread safe, so there were a couple of signs and co-signs and stuff, but usually no native library calls. Most generic Python is ok, these guys built a couple of pages for every model with Python code that was hand carefully crafted to go build some fun features and then build a model on it.

Questions and Answers

Participant 1: In your analysis, do you include only features that are intrinsically related to the security you're analyzing, or you use also external features like, for example, you're analyzing the anomalies happening in a training of security and you include, for example, the S&P 500 because there are some correlations in it?

Click: Yes. Anything that was publicly available was included, you have the model sort out the winners and losers. The data science team is doing that, but they clearly threw in major index fund behaviors to get sort of a level set on the market.

Participant 2: On the modeling side, it sounded like you said you train it on a small, you said one-tenth of a percent, and then whatever gets labeled as...

Click: Yes, a semi-supervised learning.

Participant 2: That's sounds like a nightmare scenario for overfitting, later it sounds like more for fitting on steroids.

Click: Exactly. You can go look up the technique, and there are things you watch for to avoid overfitting. The issue is you just don't have any labeled data. Now, your choices are anomaly detection, things you can do without a labeled data, but that's not actually going to catch this kind of behavior we're looking at. I said before, and you can look it up, you build a model on what you have labeled, you apply that model to more stuff, in my case, a million times more stuff, and the model gave you confidence as well as a score, and the confidence level when it was high enough, you said, "Ok, good enough, we'll take it."

Where it wasn't confident, you said, "Ok, this model doesn't understand this kind of data." Then you reiterated just a couple of times because as you said, you get this money overfitting problem, but one or two times through it isn't so bad. Then when you're done with building that model, we went from a 95% false positive rate to a 5% false positive rate. It was a huge improvement what happened before, but still, a human has to look at the results.

The good news is that we could produce results within about 15 to 20 minutes after walking in the front door of a firm. Ten minutes to install a JAR file and hook up to wherever their T and their data set was, 10 minutes to run and produce a result on their screen, and I, more than once, several times, had compliance officers look at something going on and say, "Oh, that's interesting. Traders X is doing whatever," and they drill in and they'd look again, say, drill in and, "Holy fuck," and drill in and, "Holy shit," and they look and they check three or four more, and they'd run out the door to go talk to CEO, the next day, trader X is on the street. As long as you're diligent, you're not in trouble, but if you're not diligent, you can lose your company. It was quick and it was good enough, by a long shot good enough.

Participant 3: These problems to me seems adversarial problems.

Click: Yes. I have moving adversaries who are constantly trying to game the system.

Participant 4: How did you guys design the system, keeping in mind this is an adversarial setting? In terms of designing the model, in terms of the whole pipeline, getting the feedback back, what are the challenges and how did you guys go after those designs?

Click: Well, the major challenge was lack of labeled data and the fact, of course, that the data is moving, that you get new trades every day, you get new compliance officer scores every day, and then there's new behaviors that people will declare as fraudulent, intent to deceive, not every day but it happens, or how they're trying to achieve it happens. It was kind of a case of we had to wait for a compliance officer to vote that this behavior was bad, so from my point of view as a data scientist, as an engineer, I got some results from an oracle that I then applied to this very vast data set and I filtered out, "Hey, these look interesting." He would pick through them on a daily basis and come up. I had a couple of compliance guys, plus every firm we went to would pick through and get those picks as well. Sometimes they voted on something that was funny that we didn't have a model that was working well on, and we would debate whether it was a new kind of fraud and we needed to tweak an old model or make a new one and sometimes, we made a new one.

It was a continuous arms race going on, we had an order magnitude leap in visibility and quality of the understanding of what you're looking at and what to look at, that's the game changer in that sense, it was a big jump forward. After that, we were in an arms race, the money is so big, so much, that people are always motivated to play the game.

Participant 5: You mentioned that you have as many as 1 trillion rows of data in one day, and that some forms of fraud take place over multiple days. I'm curious about the challenges that you've had working with such a large amount of data and classifying between several days' worth of that data.

Click: A trillion across the entire market, but a big firm would have a billion, a top 10 bank. In that case, I needed to have a terabyte I could inhale on an in-memory cluster, but a terabyte of memory is not that hard to get. H20, I actually did these on one fat node for the big bank. It was a fat node, but it wasn't that fat, we didn't have to do a trillion rows, we do a billion rows, we did a terabyte. You do a couple of days’ worth, it was after you boiled it down to the ETL cleaning phase of things and not on all instruments. The rapidly moving instruments were not having bad behavior that spanned multiple days, they're generally run in seconds to minutes.

Whereas, some high dollar value trades but they only traded a couple of times a day, would have the same behaviors time-skewed out over multiple days, but the count of trades was low, but the actual dollar value per trade would be in the many millions of dollars for every single instance. There was a different ratio, you just had to skew the ratio one way or the other. The dollar value in both cases were the same order magnitude, but the actual count of trades was low when it's spread out over many days.

Participant 6: The question I have is, have you been involved in any of the debt markets or the securities bond and that side of the world?

Click: We had plans to go there and didn't, we were all over the commodities market and were heading down the stock exchange market when we sold the technology off.

Participant 7: What is your requirement, since the fraud trade to discover it with your system?

Click: What is the legal requirement?

Participant 7: Another question, is it easy to capture the people going across exchanges with different accounts?

Click: There are two different things. We're a toolmaker, we're not compliance, we don't do the trades, so we're not involved legally in any way. All we have to do is supply you the tool. If we tell you, "Hey, go check this out," we do, but there's no legal requirement.

Participant 7: How many minutes between the action of fraud trading versus when you ... ?

Click: There's no legal requirement to detect it fast. The initial set, most results I gave here was called T minus one day to end of day, so within 30 minutes, at the end of the day we would have results. Before people went home they stopped trading, the exchange closed, before they went home for the day, the compliance officer had the results for the day. That was the goal, that's where we got to, the time we sold it off, we were also hot and heavy into doing it on the fly, and that case because of the clustering activity for certain things, we would do what we call mini batching, and just have a continuous run, so you'd have an hour's worth of data in memory and you would add 10 seconds up front and throw 10 seconds at the back, and then you'd rerun your models and clustering across that section and do that on a continuous rolling basis, and then results would pop out after. The different clustering ran different lengths of time, but you had to have a whole cluster available in RAM, so you might have to have a couple of minutes to tens of minutes for a slow-moving guy, less more for a faster one.

Participant 7: My last question. What's the big takeaway? What do you want people to leave with?

Click: It's doable, but it's important to be able to acknowledge that you are not perfect, you will write bugs. We're a data science team, you'll have bugs in data science, you'll have bugs in the quality of the input data, you'll have bugs in how you engineer things, you'll have bugs everywhere. You have to be able to respin the whole process, all of it, from the original data through the pipelines to the end display every time somebody makes a change. You can't have siloed pieces, I did this and you did that and I did this, you have to integrate them all because garbage in, garbage out, if the data science team get garbage in they produced a garbage model. Similarly, if they have bugs, they'll put the garbage out that goes into displaying things afterwards. You need that end-to-end thing going that you can just hit that button and rebuild everything from scratch.

 

See more presentations with transcripts

 

Recorded at:

Jul 06, 2019

BT