InfoQ Homepage Presentations Panel: First Steps with Machine Learning

AI, ML & Data Engineering

Panel: First Steps with Machine Learning

Bookmarks

View Presentation

Speed:

Download

42:06

Summary

The panelists discuss the first principles to follow when adding ML to a system.

Bio

Nischal Harohalli Padmanabha is currently the VP of Engineering and Data science at Berlin based AI startup omni:us. Shengyu Fu leads an applied science team in Microsoft Cloud & AI division. Soups Ranjan heads Financial Crime Risk at Revolut. Cliff Click was the CTO of Neurensic, and CTO and co-Founder of h2o.ai (formerly 0xdata).

About the conference

QCon.ai is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.

Transcript

Moderator: This panel is a very diverse group, and I'm actually going to let them introduce themselves rather than me trying to butcher any names. This is all about answering my need, literally, my first steps. What should I be focused on as a software engineer wanting to get into ML and start using ML more convinced leadership on things that I want to do? For example, I work for an edge company deploying use cases at edge, so I want to be able to use machine learning to be able to anomaly-detect things at the edge. I want to be able to reduce the amount of data that's coming back to origins, things like that, so this is a self-serving panel for me.

We'll start it off with Soups [Ranjan] and let him introduce himself, and then the first question you can answer as part of introducing yourself is, how did you get into ML? What was your first step into ML?

Ranjan: My name is Soups [Ranjan], I currently head financial crime risk for a U.K.-based challenger bank called Revolut. I'm based out of San Francisco, though, I've lived in the valley for about 14 years, I've worked at a variety of companies. Immediately prior to this, I was with Coinbase heading data science over there. How did I get into machine learning? I got into machine learning before the term data scientist got coined. After my graduate school, I got the opportunity to work for a company which had access to very rich data sets from telco companies because we were building cybersecurity products which we were selling to telcos. I did not learn machine learning in school because it was not even taught at that time. I just came across "The Elements of Statistical Learning" book by Hastie and Tibshirani, just read it cover to cover, started applying it to the data set I had access to, and that was it.

Click: I'm Cliff Click, I've been coding for 45-plus years, I don't know how far history you want to go back here. I'm probably most well-known for doing the core guts of the Java Virtual Machine, I did the JIT compiler that changed everyone's head about how you write programming languages. About a decade ago, I co-founded a startup doing big data and really super high-quality, high-speed exact key-value store for which the market was completely saturated, so we pivoted to machine learning.

That was the first time I ever looked at that and I started from scratch. I hadn't done math beyond algebra since high school, and so I learned all about AUC curves and logistic regression versus deep learning versus whatever, and implemented a lot of these algorithms and high-speed, high-scale technologies and h2o. Since then, I have applied that knowledge in a couple of startups, the one I talked about this morning which is looking for fraud in the stock market. I've done both use and implementation of tools for the use.

Fu: Hi, my name is Shengyu [Fu], right now I'm a data science lead in Microsoft developer platform team. Our team's mission is to infuse AI and machine learning into dev tools. Before that, I was doing advertising for almost six, five years. That's where I start involved with machine learning, we start with simple models like logics regression all the way to the deep learning models with very large scale data. Before that, I worked on a lot of the logistics and CIM type of software when I started my career after graduate school.

Padmanabha: I'm Nischal [Padmanbha], my journey in data science was funny, I didn't know I was doing data science. I got into a team at SAP and then I just joined as an algorithms engineer to that team, we were working on the stock market then. Initially, it was just writing a few time-series-based algorithms with R and then doing simple stuff of simple moving average and everything and eventually, that led to a lot of complex things. During this journey, some people on my team eventually told me, "Ok, this is what is data science," and I was like, "Oh, cool. Ok." Then I've continued working in that for some time now. Currently, I'm the VP of engineering and data science at omni:us, and primarily we're using a lot of work on computer vision and natural language processing deep learning.

Molino: I'm Piero [Molino], I work at Uber in the AI labs team. My history is a little bit weird, I actually enrolled in computer science for building video games, and then after I took my first course on AI, an introductory course in AI then I said, "Well, maybe I should pivot." I started working on that, and then I haven't stopped ever since. I founded a couple companies, I worked for a couple of startups and also for a couple of big companies, the last one that I worked for, Geometric Intelligence, was acquired by Uber, and so that core of that company became Uber Dialogues, and that's where I work now. In particular, I work on conversational AI.

Recommended First Use Cases for Machine Learning

Moderator: This question is super general and it's going to change based on every possible application you can have, but what are some really good first use cases that you would recommend for people? What is a really good use case for machine learning where you can get quick wins to be able to show its power?

Click: It's something that’s obvious and it has direct applicability to your business case. Looking for something subtle is hard, you want something obvious so that you can tell that you're finding it because you know you can find it already pretty much. You know what the hell you're looking for, then you want to be able to turn around and say, "Ok. Because, then do this." Maybe that's too simple and you already have a solution for that because it's too simple, but it's not because it gets you through that process of thinking about the problem, building the model, working application, starting, getting that cycle going where you can then expand on what you're doing.

Moderator: Tell me the examples. What are some characteristics to look at?

Click: I talked this morning about looking for fraud in the stock market. The easiest fraud to look for is a wash trade, which is the same guy swapping a stock back and forth to himself. I'm looking for an event and a counter event, generally a few seconds later, same trader, same this, the rules are real simple, everyone can spot it, it's very obvious. Immediately, bam, fraud.

Participant 1: [inaudible 00:07:19]

Click: It's done with a rule engine, but you start with building it, build the model, get the data pipeline going, get your repeatability going so that you can do something more complicated. There's so much legwork, it's hard to get started, so do something where the legwork still has to be done, but your goal or the thing you're trying to look for is blatantly obvious. Don't add too many problems all at once. Put the machine learning problem as the simplest you can get it, solve all the other pieces that feed into that, and then start getting a more complicated ML, but you have to have all those pieces before you can even begin with.

Fu: I would think there are two categories of problems, big directions I would look at. One is from your customers' perspective, see what are the compelling problems they have, for example, when we are working on the code completion problem via Studio because millions of developers are using these to code, so we want to improve the productivity, so what's most occurring thing in this process that we can optimize? Then the second question is whether you have enough data to do that, to apply machine learning to help that process. I would start from those two angles: the customer needs and then whether you have data. If those two satisfy, then you construct the right team and to tackle the problem.

Padmanabha: From my point of view, the biggest challenge with machine learning is identifying and formulating what the problem is. Usually, a lot of us have a problem in saying, "Ok. What is it that I really want to solve?" If you want to understand machine learning and what's the first project that's very close, I would even suggest looking at Kaggle competitions, where the problem is defined, the data is there, and there's a very big chance that a lot of people are already attempting, so you have benchmarks. It's a good place to start because you can pay your focus and attention on understanding the data, trying out a few different machine learning models, understanding different techniques, and how it actually evolves. That's a very good place and I've seen that it helps a lot of people as things get very ambiguous when you start Googling for where to start with machine learning. That's a good place to start.

Molino: I want to just add a little piece to the data thing. I believe that, at least what worked for me, I can tell you, is to look for things that people are already doing, because if there's something that already a bunch of humans are doing that and which is not really something that they may be better off with either support or with something that they could do that's boring to them or they don't want to do or something like that. Those are, in my personal opinion, the best cases.

The customer support use case that is one of the ones that I worked on is perfect from that point of view, because in particular, there are many options for the customer support representatives, and so it's a time-consuming job and they're doing it already, and so because they're doing it already, you have a lot of historical data and it's something that for them is really boring because they have these thousands of options and they have to scroll through them and read all of them. If you can, for instance, surface a certain number of options to them, you make the humans much more effective and if you're really good at doing that, then you could actually remove at least a little percent of the things that they are doing from their domain of expertise, like actually remove them entirely. These are the kind of products that I believe are the ones that are good to start with, where there's a lot of data because humans are already doing it and they are bored doing that.

When a Rules Engine May Not Be Enough

Moderator: I want to come back to this rules engine question. Soups [Ranjan], last year, you did a talk, a 10-minute talk about when to go beyond a rules engine to go to a CNN. Can you talk a bit about some of the signals when a rules engine may not be enough?

Ranjan: I think the times when rules engine stops being effective are when you realize your software engineers are very afraid to add a new rule because they don't really understand if they add a new rule, how would the whole system behave. Or, when you see that they're very afraid of removing a rule because if they remove the rule, they have no idea how the whole system is going to behave. It is at that time you realize that you've added too much technical debt and that's a prime time to actually move to a machine learning-based system.

Fu: I think in a way, you can manually create machine learning as automatically learning the rules, instead of manually creating rules. The machine model, the deep learning model essentially is not rule-based, but it's generating some mathematical rule, it's automatically learning from the data instead of manually creating the rules. In advertising, when I joined in 2008 Microsoft, we actually had a bunch of segment analysts to create the behavior segmentation to sell to the advertiser. Later on, it was very difficult to manage those rules and there's no good correlation between the rule and actual performance of the segment. Now, you start to build the correlation between the user features and the performance characteristics for those segments and then you go from there and can generate the different machine learning models from that.

Click: I have a simpler starting point, which is there are a lot of tools in your toolbox. ML is a new one, but it's not necessarily right tool for the job, so make that call first. So many times people will talk to me and say, "Hey, ML it's going to save us. How do we do ML here?" The answer is, "Well, a little bulk math will just solve your problem directly and machine learning is just not appropriate." That's not necessarily what people want to hear, but it's more often true than not.

Machine Learning is Not Necessarily the Right Tool

Moderator: Expand on that a little bit. What are you talking about? Linear regression? Are you talking about something specific? Are you just talking about statistics? What do you mean when you say maybe machine learning is not necessarily the right tool?

Click: I get approached by a lot of people that say, "Hey, we're going to do ML to do something here.” For instance, I'm currently doing an IoT startup. Ok, great. All the clients, customers say, "Hey, run ML on the sensor data." Ok, someone's looking for drops falls on their transport ship and they've got a little widget that's got an accelerometer like your phone has.

A six-foot drop is a few seconds, a second of free-fall and then a shock. There's no need for machine learning here. This is a very simple pattern that can be detected right away and I can tell you the length of the fall, and it just doesn't make any sense, so first, try the obvious thing before you try an ML solution, and if it's going to work, you're done. That's what I'm saying here. There are certainly times where ML makes sense, and when you can't touch the rules engine because it's too complicated, that's a perfect example. But there are also cases where people push for it because it's the new hotness.

Moderator: Because it's cool. Right.

Padmanabha: Just adding to that, a lot of talks yesterday and today as well were talking about pairwise distance and identifying things, and you see this a lot in natural language processing as well, and if you're building a website out and if someone types a search query, the first thing that you can think of is, "Should I write a deep learning network to power my search engine?" Actually, you don't need to. There are still technologies like Leucine and Solar that do a fantastic job of having traditional algorithms that have been there and have been tested for so many years that worked quite well. I really agree with what they're saying, ML is not really the answer to everything. It's just one of the tools and you just have to know how to use it. At a certain point in time, you will realize if you have written too many rules, because simply things won't work after a point.

Wrong Approaches When Starting out

Participant 2: You've been talking a little bit about the starting first steps, but now when you go to Google as you had mentioned, there are 300 million algorithms, 300 million whatever, tools out there to use. It does get confusing and so on, but what would you say is the absolute wrong thing from your wrong direction to go, when you're absolutely starting? Do you go to tutorials? Do you go to TensorFlow versus PyTorch versus something else? What is the worst thing you could do in this regard?

Click: Not having an end goal in mind. Have an end goal in mind, what is your goal? If your goal is to dabble in data science and become better at it, then maybe Kaggle is a great place to start with. If your goal is to fix some interesting business problem, like how long I can run these buses on the city streets before they must go in for guaranteed maintenance, I have a different end goal in mind, and now maybe I need a different starting point and a different direction ahead. I'm going to look for production ability of whatever I end up doing, and the actual ML might be simple, straightforward, but I need to be able to go end to end, and tell the maintenance crew when to pull a bus off the street. Have an end goal in mind, in many things in life, if you don't have an end goal in mind, you'll meander and flounder about.

Participant 2: Should you start off with TensorFlow or Scikit-learn?

Ranjan: You could start with the Scikit-learn, I would say, but actually, even before you start with Scikit-learn, I just want to add to what everyone here is implying, which is that you have to first measure what you want to improve. If you can't even measure it, then there's no point in using machine learning. You need to know in the case of fraud, what is my fraud rate? Is there a problem there? Is the fraud rate too high, and therefore, how can I bring it down? Until you define the metric you want to move, I would never start with machine learning. I think that's the main fallacy I see, which is that people treat machine learning as a black box where magic happens when you put data into it and train a model.

I think the machine learning part of it is not getting easier. You can actually treat machine learning model building as a black box, but before that, you have to do a lot of legwork when it comes to defining the problem, formulating it precisely, figuring out what is the business metric you want to move, and then the latter half of the black box which is, how do you actually productionize it and measure it going forward?

Fu: I think I echo the metrics one strongly, that if you don't have metrics, you cannot do anything. Besides that, for any problem you're looking at, you always start exploring the data first, do a lot of statistical analysis before you actually apply machine learning methods, so you get a lot of intuitions from the data and also you can find some of the patterns just by looking at the data. Then you'll know which machine learning methods are going to be enough for this problem after you're looking at the data. Maybe you just start with a simple method before you actually look at the deep learning methods. Also, you need to consider a lot of the constraints in your particular scenario. If it's on edge, then you have a lot of computing resource constraints. All of those have to be considered before you choose a particular method.

Molino: I have a really personal point to this, actually what not to do. My personal opinion is that what you should not do is you shouldn't go to the code. You should first learn the theory and should first have a top-down kind of approach and understanding things, and then when you feel like you're ready to go bottom-up, then you pick up a task that's super important to you personally, like, if you like wine, the task could be wine recommendation or something like that, because that is what is going to motivate you in going through all the learning process that will bring you to the point where you actually have something that works, and so keeping up that motivation is really important. Picking up something that works for you it really is important. Then after you build this understanding both top-down and bottom-up, then you can go and apply what you learned on other things. That's interesting.

Moderator: When you say top-down, can you elaborate? What do you mean by top-down?

Molino: Learning the theory. For instance, if you start with Scikit-learn, or if you even start with Ludwig, what happens is that you don't really see what's going on. You pick an algorithm, at randomly almost at the beginning, and you get a result out and then you have this black box, but you are not going to do any progress in learning in the field if that's what you're doing. You should figure out what's going on in the black box first.

Moderator: One thing I hear a lot of people talking about machine learning is starting with just intuition and just what makes sense. Does that counter that top-down thinking or is that complemented?

Molino: I think it's complementary because the intuition is what drives the understanding also for the top-down. When someone is going to explain to you how an algorithm works, if you don't have any intuition, you're going to read the same paragraph 10 times before you get it. The intuition is what is going to help you to nail down the algorithm, but I still believe that the top-down approach, at least at the beginning, is the best thing to do in the foundationing.

Strategies for Iterating

Participant 3: My experience with machine learning and also a lot of the talks- like Uber, for example, a lot of it, they make predictions which they can very quickly figure out whether these predictions are good or not. Have you worked in a domain where there's a significant lag between making a prediction and figuring out whether it's correct or not? What kind of strategies? We're working in digital agriculture, so if we make prediction as to yield, we may have to wait months to find out. What are good strategies for iterating and using that time productively?

Molino: I'll direct you to this because there could be situations where actually you never really have the feedback that you want, for instance, for recommender system. You will never know if you provided different recommendations, if the users would have done something different. You can run A/B tests for sure, but at the same time, you'll never have the counterfactual in any case. Those things are also to be considered, and that introduces also bias in the way you're evaluating the system and bias in the way that the next iteration of the system is going to be evaluated on, because you're evaluating on the outputs of the previous systems. All these problems are something to be considered. I don't have the silver bullet to the solution, maybe someone else wants to add more on that.

Padmanabha: Sometimes the predictions that you do take quite a long time for you to validate if it's true or false, or in some situations, absolutely not, but if you're doing agriculture and you say there's going to be a yield in six months, I think, by definition, you need to keep monitoring the state of the system to see what you've predicted, everything holds good for the time period that you've said the yield happens. This is also where am understanding of how confident you were with the prediction itself, how big and diverse your data was, and understanding some statistical significances of your prediction and how you built the model and what it's seen, all of that is very important, and just not the prediction itself.

Fu: I’ll quickly add one point, that if you don't have real-time or short-term feedback from the online system, you at least you should have a very good offline evaluation method so that you can iterate on your model by testing using historical data.

Click: I'm going to say the same thing but in a different way. You just got a new supervised data point, it just took you a couple of months to get one. Then keep collecting them, don't ever lose them, but it just takes you a while to get supervised data. Me, I never got more than 1/1,000,000th of my data set labeled. I could get any point labeled I wanted, but I could only get 1/1,000,000th of them totally labeled. The rest were never ever labeled, that's just what it was. If it takes you three months to go around and you're trying something new, you don't know. Once you know, add it to all the points you have before and use that going forward.

Ranjan: The only other thing I would add is, I'm pretty sure you could figure out ways where you could measure intermediate steps and figure out or create some sort of other supervised model which tells you at these intermediate steps after you sowed the seeds, what were all the variables that you noticed, and did they lead to a good yield eventually or not? I'm sure you could solve this problem by just breaking it into steps.

Failed Machine Learning Projects

Participant 4: I have a quick question for the panel in terms of your experience from applying machine learning to build machine learning products, and it turns out that the project failed, it didn't work out the way you expected it to. Just curious if you can share some of those experiences.

Ranjan: In my experience, the primary reason when projects fail or don't lead to the success that you thought it would lead to is when the problem itself was not well-defined. When people didn't understand the business metric they wanted to move and therefore, how to formulate it as a machine learning problem. That in my mind is basically the Holy Grail now. You have to always understand and break it down to what is the business problem you want to solve.

Fu: In our team we spent almost one year on a particular deep learning project, trying to learn from past code reviews to be able to build a bot to actually do automatic code review. It was a deep learning model, so when we tried to evaluate with our own engineers and the dogfooding, it turns out a lot of the recommendations didn't make sense. The reason I think the deep learning didn't work here is because it's very difficult to explain why the model made that recommendation. It's very hard to debug and the interpreter apparently is very important here, and it's very difficult to do that. That's why we're actually going to a simpler approach and looking at using the machine to generate rules, so that makes it much easier to explain and we are taking a simpler approach, basically.

Padmanabha: Machine learning projects are good to fail. It's ok because when the project fails and it definitely does, when you start your machine learning journey, the first 5 to 10 attempts are failure, because everything works well when you're testing it out on your test data set. You take it into production and then on the production data set, it's never the same, but it's good because you really start to then spend time to actually understand what's going on, why is it failing, and what can we possibly change. You will go through the progression of understanding sometimes why simple rules help, simple algorithms help. You don't really need a gradient boosted tree to do a simple classification between a cat and a dog. You could do that and so many different ways. I think sometimes over complicating things and trying to use the latest and greatest also causes impact in terms of delivery, because you want to be cool and you end up using something that's latest, but you don't know why it works or why it does not work. As a machine learning engineer, I think it's simply ok to understand out of the 10 experiments you run, 9 are bound to fail, 1 will work, and that's how you learn.

Molino: Just relating to that, it's also a matter of the process, it's an imperative process. It's an imperative process, you try something, it fails, you learn from that and you improve it, or maybe it doesn't fail, actually it works, but doesn't work up to the standards that you were expecting and so you keep on doing it all over again. Another thing I can add is a personal experience with a failure that is actually a good failure, in the sense that good failures are the failures when you actually figure out immediately that this thing is not going to be as accurate as you were expecting.

In this specific case, my personal experience was a model that had to balance between the false positives and false negatives, that false positives were much more expensive than false negatives, but even counterbalancing for the factor, we figured out that because the false positives would have needed to be manually reviewed by expert users and the false negatives were not that important, in the end, even the best model we can come up with was not economically viable, but we figured it out immediately because you had a nice economical model on what would it mean to have that model in production in economical terms. If you can figure it out immediately what's going to work and what's not going to work and what's economic and what is not economic, then that's a good thing because you can kill immediately the project and not spend time and effort and money on that.

Interpreting Machine Learning Systems

Participant 5: My question is around the interpretation, I heard you mentioned that interpretation is hard. In healthcare, I work in clinical decision support, but it's the same for any domain where if you have a machine learning model that gives a wrong answer, having to interpret why that is to a subject matter expert in a specific domain, in healthcare it can be especially difficult. I'm sure other domains have similar complications. How do you deal with that when it's very much like a black box?

Click: Something similar here is a fraud domain, if I declare fraud, somebody goes to jail in theory, but of course they don't without a review by a federal judge. Now, explain your ML model to a federal judge, definitely, somebody who's not an expert in this field and doesn't really care about the math. He's not going to learn, he's not going to care. I treat it as promise power steering, the goal was to point you in the right direction, but have transparently available all the facts that were used to build the decision. Don't explain the decision itself, but bringing the facts to bear that you used, and then present that in a way that the person looking can understand maybe there was something or maybe why you were here at all, but they could go further and now bring your expertise to bear. I empower the experts to make the actual call, and all I did is I found the needle in the haystack.

Ranjan: It's pretty important when you have a human in the loop for any machine learning system. For the humans to actually trust the machine learning system, it has to be explainable, which is why I think majority of successful machine learning use cases always use logistic regression, because by nature of the algorithm itself, all the features are very intuitively explainable. You can very immediately explain that this feature had the highest contribution towards the prediction.

The second thing I have noticed is that this is an up-and-coming area, like interoperability of machine learning models. In fact, in one of the QCons last year, we had a talk by someone from Cloudera, I think Mike Lee Williams. He has done some amazing work on Line which is basically an approach that you can use to interpret even black box models. The general theme now is that, "Let's use neural networks to come up with the predictions, but at the same time, we'll also run other models like LINE and there's another one called SHAP, and we'll use these to actually explain the output of that model."

Time Frame for Basic MVP & the Simplest Model to Use

Participant 6: What's ideally the time frame you guys suggest for a very basic MVP, considering that it's a first project or first venture into ML? The second thing is, what could be the simplest? When in fact, Nischal was saying, "Don't go for the latest tool, greatest tool. You don't know whether it's working or not working. Why is it working? Why is it not working?" What could be the simplest of the tools or the simplest of the frameworks to start with?

Click: MVP depends on what you're going to deploy to and do you need to deploy at all? Assuming you need to do some kind of deployment, you're probably in it for several months to half a year, because of the deployment issues, nothing else. Simplest model, I would literally put a black box that took data in and threw a random number out. Build a deployment, get all that functional, then refine what's in the box. Put a rule in, now I'll put a GLM in, now I'll put a GBM, now I'll put a deep learning in. Start by getting that production pipeline going because that's how you're going to continuously iterate, over-refine, and build more features here and better results here and know that you're actually improving.

Participant 6: Python, Scikit-learn, containers deployed with Kubernetes, MVP? Something like that?

Click: It's fine. It's not my personal choice, but I'm not a Python fan for this.

Participant 6: It's something like this?

Click: Something like that, yes.

Participant 6: Specific, what technology stacks? What's a good MVP for people to have an idea around their hands and what they could do?

Click: I was deploying to people's private servers, so it was no cloud, I needed a portable solution to walk in, so it was Java JAR, and I know h2o so I was using h2o for machine learning and I walked in with the Java JAR. Not long after that, I hired a data scientist who loved Python and Scikit, and it was now Python and Java walking in, but we walked in with a JAR as part of the deployment process.

Fu: I would say start Scikit-learn if you can run a single box. If you need distributed, start with Spark and MLlib.

Padmanabha: With technology itself, I think Python is the best bet right now especially for machine learning, not because it's the only thing that's there, but it's just the community around it. Every time you run into a problem, there is a 95% or even close to 100% chance that that's already been answered on Stack Overflow. That is quite powerful because the way you choose technology is simply on the kind of community adoption there is. Initially, there are going to be a lot of hiccups and you would need constant 24/7 support of Stack Overflow to go through, I would definitely suggest Python. If you can get it to go through a JAR that's also a very good deployment thing or I would say Docker, because Docker right now is well-maintained, managed, community is good, and it groups all your dependencies that you want to run very easily.

Data Scientists, Engineers and Structuring Teams

Participant 7: This is an easy one. In order to run a successful ML project or a product, you would need a data scientist as well as an ML engineer or a data engineer. How do you make these two multiple beasts work well efficiently together? If you were to only hire one, which one would you hire first? Data scientist or data engineer?

Click: Yes. You hire a dude who does both. You have to have somebody with enough data science skills, chops to go make a model that's reasonable and not get stupid with the data. You have to have somebody who can do something with production setting. Sometimes you can find them in the same head and that's what starting startups typically end up doing. Some guys wearing all the hats.

Ranjan: Actually, I would neither hire a data scientist nor a data engineer. I would hire a machine learning engineer, which is basically a software engineer who knows basics of machine learning and can build the whole thing end-to-end first, because the data scientist in my mind is someone who's excellent at machine learning but is going to live in the world of prototyping and building a super sophisticated complex model, and a data engineer is just going to build you the pipelines. First, start with a machine learning engineer, next one data engineer, third one, data scientist.

Participant 8: How do you convince management to start collecting data long before they know what the actual use case is?

Click: How do you convince anybody of anything? You got to convince them there is a potential business case down the road. The data collection costs can be really cheap and sometimes you can just slide it in there. We piped our logs to an S3 bucket and we forgot about them, but three years later, we got a lot of log data.

Participant 9: Sudhir asked the question about who to hire first, but what about longer term? How should you structure teams? Eric Colson, chief algorithm officer at Stitch Fix actually talked about generalist versus specialist. Should teams be specialized to do something, or should they be generalized to do lots of things?

Fu: At least in my data science team I tried to hire people with both data science and engineering background, but I also take strong statisticians or very strong engineers into the team so we have a mix of talent in the team, so then when we have a project that requires both, then they will collaborate. Also, we try to develop their talent along the way, the statistician can learn the engineering part on the job and the engineer also can learn about statistics and machine learning on the job.

Padmanabha: One of the things that we tried at omni:us right now is, and as part of engineers in the audience is, this concept of squads, chapters, guilds, and tribes. You have a mix of everybody in a particular team, and that's exactly what happened. It starts rubbing off each other where we currently have data scientists at omni:us building their own Docker containers, they know how CI/CD works, they know how Helm is coming, kicking in. The data engineers and full-stack engineers know things like what is precision, recall, why are we using a certain type of model? Everybody is striving towards an end delivery. A combination of all these different types of roles into smaller teams working towards a good end delivery goal is quite important in how the teams start to function and grow, and I think that starts to work really well for us.

Molino: I can add an additional dimension which is the size of the company. For smaller companies, I would expect mostly a team of generalists with one or two specialists, while bigger companies can afford having teams of specialists that actually support teams of generalists. It depends on the size of the company.

See more presentations with transcripts

Recorded at:

Jul 18, 2019

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?