Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Podcasts Shubha Nabar Discusses Einstein, the Machine Learning System in Salesforce

Shubha Nabar Discusses Einstein, the Machine Learning System in Salesforce

Shubha Nabar is a senior director of data science for Salesforce Einstein. Prior to working for Salesforce, she was a data scientist at LinkedIn and Microsoft. In the podcast she discusses Salesforce Einstein and the problem space that they are trying to solve, explores the differences between enterprise and consumer for machine learning, and then talks about the Optimus Prime Scala library that they use in Salesforce.

Key Takeaways

  • The volume of data, and hardware advances have made it possible to do machine learning to do them a lot faster.
  • AI is a science of building intelligent software, encompassing many aspects of intelligence that we tend to think of as human.
  • If you can’t measure something, you can’t fix it.
  • You have to think about what you can automate, rather than having a human to try and engineer out all those features.
  • Get feedback on design.

Show Notes

What is Salesforce Einstein?

  • 01:45 Salesforce is a platform that businesses use to manage all their customer relationships, and Einstein is a machine learning intelligence in this platform making all of the business applications smarter.
  • 02:10 This goes across sales, service, marketing and more.
  • 02:15 It’s also a platform, so you have a community of developers, who build business apps on this platform and make them available on the app exchange.
  • 02:25 You have hundreds of thousands of businesses and millions of people who have used Salesforce-powered business apps every day.
  • 02:35 As an example, you can have hospitals using it to manage their patients, sales teams who are using it for their whole process, marketing teams to manage their marketing campaigns.
  • 03:00 They want their business apps to be smart, in the same way that consumers apps are, like Google Now.

What are some of the use case you enable?

  • 03:10 Sales people want to know how their sales should close, or marketing teams want to do predictive segmentation of their services.
  • 03:20 You could have call centers that want faster routing.
  • 03:30 There’s a really long tail of use cases - for example, universities who want to manage their student lifecycles; how likely students are to accept their offers.
  • 03:50 The problem is that for the majority of businesses, unless you’re in the top 1% like Facebook or Google, you don’t generally have the data engineering or data science backgrounds to be able to use it for predictive value.
  • 04:10 Salesforce Einstein allows customers to unlock the predictive value in their own apps.
  • 04:15 It’s so simple that you don’t have to be a Facebook or Google in order to wrangle the data, analyse it and derive the value for you.

How do you create a commodity pipeline that can be used interchangeably between universities?

  • 04:50 We have very rich metadata about our customer’s data. We know this is your leads table, and it joins with your accounts table.
  • 05:05 We also know for each object whether a field is a phone number, an address, and so on.
  • 05:15 This allows us to automate other things that data scientists usually have to do manually.
  • 06:30 We can take a configuration of a specification about how this object relates to these other objects, and we can use that to generate a machine learning pipeline.

How did you build this?

  • 05:50 We built a framework that allows you to specify an outline of the particular problem.
  • 06:00 It then goes in and does the things that a data scientist would usually do, like the feature generation, the feature engineering, the feature selection, the model selection, the data balancing, the recalibration - all automatically based on the metadata.

What does the outline of the business problem mean?

  • 06:30 What we’ve built so far - we have a platform that we’ve exposed to our internal developers, and they have used it to build packaged apps that we know our customers are interested in - lead scoring, opportunity insights and so forth.
  • 06:40 The next thing is to open up the platform so that external developers and less technical personas like admins can build their own custom models.

What’s the roadmap?

  • 07:00 We’ve released APIs that allow you to upload a data set, and then perform sentiment analysis or image recognition.
  • 07:15 We use deep learning to make this happen, to give really accurate models.
  • 07:20 The other step is about making them available in the context of your app

What’s your definition of Artificial Intelligence?

  • 07:40 AI is a science of building intelligent software, encompassing many aspects of intelligence that we tend to think of as human.
  • 07:50 For example: learning, reasoning, perception, language understanding and so on.
  • 08:00 Machine learning in particular is teaching machines to do particular tasks by learning from past examples without explicitly programming them.

You mentioned you would like to democratise AI technologies - what did you mean by that?

  • 08:20 Making these technologies much more accessible - all these business are generating so much data, but very little of it gets used.
  • 08:35 Unless you are in the top 1% of companies today, it’s very hard to have the right level of data scientists to create predictive data models.
  • 08:45 The work we’re doing at Salesforce Einstein is democratising these technologies, to make them much easier for businesses to benefit from machine learning.

You mentioned a story about eliminating mundane tasks - what was that?

  • 09:10 My roommate used to do biology experiments in the lab, and she would spend a lot of time counting worms.
  • 09:30 This seemed like the kind of task that would be ripe for disruption with some of the modern image recognition techniques.
  • 09:50 There’s a lot of ‘lab—in-the-cloud’ technologies - some startups are starting to think about this kind of thing.

Is there a deeper trend towards AI and ML, or is it just volume of data?

  • 10:30 It’s the volume of data, and the advances in hardware that has made it possible to do the calculations that are needed for machine learning to do them a lot faster.
  • 10:50 The benefit that people are seeing from applying these kind of technologies are making their apps smarter.

What other frameworks or toolkits are out there?

  • 11:15 Engineers should start getting fluent in the language of machine learning, so I think that Coursera has lectures available by Andrew Ng.
  • 11:30 Becoming more conversant in the terminology is quite useful.

What are typical stages of a data pipeline?

  • 12:00 You start by prepping the data; you might join in data sources, aggregate, throw out outliers or bad data.
  • 12:15 You’d then do feature engineering where you transform the data to generate more data, like quantising ages into buckets.
  • 12:35 You then train a bunch of different models with the data set, evaluate it, and then go back and do the cycle again.

What about bias in data sets?

  • 13:15 There’s a lot of research going on about how to construct fair algorithms; but one of the most basic steps to avoid propagating biases to data that’s going to drive decisions is to measure the biases you want to prevent.
  • 13:30 If you can’t measure something, you can’t fix it - so instrumenting the pipelines and measuring the things you want to protect against is a big step.

How do you identify the things that might be surrogates for biased data?

  • 13:45 People tend to exclude the direct data, but there may be indirect data like zipcode which could be correlated with the biased data.
  • 14:00 You explicitly want to include those factors to explicitly measure the bias.

What’s the difference between enterprise and non-enterprise ML?

  • 14:40 The cycle described previously in the consumer space you have a data scientist building a hand-tuned model with a single data set they understand very well.
  • 15:00 In the enterprise space this goes out the window. You’re building so many different models for so many different businesses and use cases that you can’t have dedicated data scientists.
  • 15:10 You have to think about what you can automate, rather than having a human to try and engineer out all those features.

What does a data scientist in an enterprise company or consumer company focus on?

  • 15:25 For consumer space, you’re optimising data based on known use cases and things you understand really well.
  • 15:35 In the enterprise space, you’re optimising for many different unknown use cases with data that you may not understand really well.
  • 15:50 There isn’t one model that fits all.
  • 16:00 As an example, if Amazon is trying to predict whether someone is going to make a purchase - I’m intimately familiar with the signals that might go into this prediction.
  • 16:15 I know to ignore signals like credit card swipes; because that was more effect rather than cause.
  • 16:25 When I am intimately familiar with the data I don’t want to include credit card swipes because I know that isn’t correlated with the prediction.
  • 16:30 When you’re dealing with data where there may be automated data processes filling out these fields, these are cause and effect.
  • 16:50 At Amazon or Facebook, you’ll have a team of data scientists working on one data set.
  • 17:00 They have a lot of tribal knowledge about the form of the data.
  • 17:05 They know which data to include and exclude from the models.
  • 17:10 There’s a lot of manual effort there that you can’t do in an enterprise space, because every data set is different.
  • 17:15 Even if you look at a single object - such as a lead, and you want to predict lead conversions - this customer might have a bunch of custom fields which are different from other customer types.

So how are you able to do this with your data?

  • 18:05 With Optimus Prime we say that customer fields correspond with these types, so automatically do something with fields of that type.
  • 18:15 We automate things that normally a data scientist would have to do.

Can you give some examples?

  • 18:35 Feature engineering is a great example - there are a bunch of tricks that people try; if you know that a field is an e-mail address, you can filter out the domain and then correlate that with other well known domains.
  • 18:50 If you a have a large piece of text then you can automatically extract the top ten terms.
  • 19:00 If you have a phone number then you can automatically check the area code.

What are some of the challenges the teams face?

  • 19:45 One of the challenges is that even if we try and solve a single predictive use case, even that is non-trivial to solve.
  • 20:00 Take for example customer churn; every business wants to be able to predict customer churn.
  • 20:05 There could be many sorts of data sources that could be relevant to those predictions.20:10 There could be business processes that are populating those data sources; and even if you have common data sources, they could have very different data shape.
  • 20:20 Customer churn may be really rare for one business and really common for another business.
  • 20:30 What it means is that even if you’re solving one thing, you’re building thousands of personalised per-customer models for just a single use case.

What is Optimus Prime?

  • 21:00 It’s a machine learning framework that we built for building modular views and strongly typed machine workflows for minimal hand tuning.
  • 21:05 It’s essentially an automated machine learning framework that we use to solve some of these problems.

What’s it written in?

  • 21:20 It’s written in Scala, built on top of Spark ML, because it’s great for working with large data.
  • 21:30 We loved some of the Spark ML abstractions that they have built on machine learning pipelines.
  • 21:35 We are looking to dissociate from Spark for dealing with small data places, and the overhead fo distributed computing is too expensive.

What makes it special?

  • 21:50 The things in the pipeline are strongly typed, and we do a lot of stuff with those types.
  • 22:00 The types drive a lot of the automated feature selection and model selection.
  • 22:05 Those are some of the machine learning automation things you get out of it.
  • 22:15 The type safety is also good for developer productivity.
  • 22:50 It’s an internal tool; it’s not open-source.

What are some of the things you have learned?

  • 23:00 Don’t try to reinvent the wheel - the first version of Optimus Prime was when Spark ML pipelines were in their infancy.
  • 23:15 We came up with our own API; try to leverage what’s happening in the open source community. In the second version we did.
  • 23:35 Another thing would be to get feedback on design. Any time you want to implement a new feature, go through a design review.

What are your plans for the QCon SF AI track?

  • 24:25 For most people their initial exposure of machine learning is through parallel compositions in a CSV data set, where they try out something on that.
  • 24:35 In real life, it’s a lot more challenging than that - you have to handle ETL, schema evolution, taking models from development to production - some of the companies in the world have solved these problems.
  • 25:00 I also wanted to see how new technologies like deep learning are being used in production today.


Salesforce Einstein
Andrew Ng's Course on Corsera
Article on Spark ML

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article