Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Building a Voice Assistant for Enterprise

Building a Voice Assistant for Enterprise



Manju Vijayakumar talks about Einstein Assistant - an AI Voice assistant for enterprises that enables users to "Talk to Salesforce". She goes through the high-level architecture and workflow starting from Automatic Speech Recognition (ASR) on device to using NLP for identifying entities and intents in a single dialog conversation text.


Manju Vijayakumar works as a Lead Software Engineer at Salesforce.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Today's topic is going to be all about building voice assistants for enterprise. Voice assistant is something new, and is the first iteration of a product that Salesforce has introduced.

A little bit about myself. I am Manju Vijayakumar. I'm a lead software engineer at Salesforce, working on Einstein Voice assistant team that's responsible for this voice assistant. So basically with voice assistant, our users are able to talk to Salesforce literally, and just use natural language to make conversational updates to Salesforce. I am in a team of engineers and product managers that work very closely with Salesforce AI research group. So we are constantly collaborating with them and bringing the latest research in NLP and voice to production and applying them to solve real customer problems.

So, a little bit about the agenda today. We're going to talk about why voice, what's the motivation of using voice? Followed by a quick demo of the Einstein Voice system itself, and then we will deep dive into what is a conversational AI ecosystem and the natural language understanding services that power this, followed by a wrap-up with challenges and the lessons that we have learned building and deploying this, and also a little bit about the future concentrations and what's next in exciting research in NLP and AI.

This is built for Salesforce users. So a lot of our enterprise users use Salesforce for customer relationship management. But we also develop this in a platform-first way, so you could build custom apps on top of it. So you don't necessarily have to stick with Salesforce. So you could make it work with like several data systems.

Why Voice?

So before we begin, I have a question for you. How many of you are familiar with the American TV show "Friends" from the 1990s? Okay, most of you. So I am a big cult follower. So imagine my surprise when I found this scene, the opening scene from an episode in season three. So these are two characters, Pete and Monica. And Pete is telling Monica all about voice recognition. He goes something like this: "Voice recognition is going to be pretty much standard on any computer you buy. So you can be like, 'Wash my car. Clean my room.'" Now, this is back in 1995, and it was so futuristic at the time right. Fast-forward 20 plus years, and we are here and we're pretty much using voice for familiar things, like you ask Alexa to look up weather, or play music, for instance. It's pretty much ubiquitous.

And that should not come as a surprise because we see technology trends changing every couple of decades. And it's changed essentially the way users interact with computers and devices. So we started all the way from command line interfaces, went to GUIs. And then 2007 is when iPhones really changed the way with touch interfaces, user interactions have become even more natural. And now, we have arrived with voice, with more and more voice devices helping you do that. But if you see, look at voice and how voice technology is used these days, it's pretty much very consumer-centric. It's for very little things, like set up a reminder or create a shopping list. It seems pretty straightforward. But we asked a question: what if you could utilize this for a more B2B case, so basically business-to-business?

A lot of business users use apps which are really painful. And they use them because they have to get to do a lot of work in their everyday lives. And this is not just a simple read-only. You have to make constant updates to legacy data systems and things like that. So, at Salesforce we asked ourselves a question: what if using a business app was as natural as having a conversation? And that became the primary motivator for our vision. So our vision is basically to deliver an intelligent assistant that leverages state-of-the-art voice and natural language understanding capabilities that are available today to really understand and support our users to accomplish their goals.

Now, this is a lofty and an ambitious goal. But we're getting there. This is our first iteration of voice assistant. And we're going to dive in now with recording of a demo using voice assistant. But before that, I want to set the stage for this. Who is a Salesforce user? And I want to introduce a use case before we go into the demo.

Demo of Einstein Voice Assistant

So here is Amy. She is a busy salesperson. So she is in and out of meetings every day. She just got out of a big customer meeting and she closed a deal with Acme Corporation, and she's very excited about the deal. So as soon as she gets out of the meeting, she has just enough time to log her meeting notes and then move on to her next meeting. So typically, a salesperson like Amy sits down at the end of the day and then she has to glean through all of her meeting notes and make lot of updates to a Salesforce kind of a system, where she has to enter all of that unstructured data from a meeting note, for example, into structured entities like system of records and customer accounts and things like that. So basically, Amy wants to update Salesforce. So let's see how voice assistant can help Amy do this.

So she pulls up her Einstein app from her mobile phone. It should come up in a minute. "Met with Chris Hopkins from Acme Corporation. We had a great meeting and closed a deal for purchasing merchandise. Follow up with Chris next week. Change the deal amount to one million dollars. Set the close date to November 15th." Okay. So she's entered all of her notes. It's transcribed from speech to text. Let's see what happens when she hits the analyze button.

So the request goes on to Einstein. Einstein is figuring out everything to interpret the unstructured data, and it's extracted the key entities, like Acme Corporation, for example. And surfaces a record from Salesforce to Amy. So Amy confirms that this is the record that she's talking about. And she moves on to the next screen. So now Einstein also finds that Chris Hopkins is a contact relevant to Acme and surfaces that. Now, next what happens is really interesting, and my favorite part. Basically, Einstein is figuring out, what does Amy want to do? So, for example, when she says, "Follow up with Chris next week," Einstein understands that you have to create a task or a reminder so you can follow up with Chris next week. So once all of that is confirmed by Amy, Amy goes ahead and saves it to Salesforce. So in literally in just three taps and just using her voice, Amy is now able to update Salesforce. And how incredible is that?

So for business users, this is a big deal. So Amy is like really happy now and super pleased with herself. And why is that? So the key and the critical piece of this is converting all of that unstructured data into structured data. And that's where the biggest value-add for business users is. Isn't just not keeping Amy productive, but Amy doesn't have to know anything about Salesforce CRM system. She doesn't have to learn what are the structured entities, what is an account, what is a record in Salesforce. All she needs to do is log her meeting notes and Einstein is going to convert all of them into whatever makes sense for your structured entities in Salesforce.

Now, apart from that, there is a pretty good accuracy and timeliness of data capture. So usually, sales folks or anybody like business users, usually do not log their meeting notes in the middle of the day. And obviously to make updates with Salesforce, it's going to take a while. So Amy actually would push it off to end of the day and do these changes. But it's super important in the business world for all of these data insights to flow into CRM in time, and also it's visible to her team. So as soon as she's closed the deal, her team knows that a deal has been closed. And they have to do a lot of follow-ups with all the stakeholders. So it's super important that it's visible to the team and they get a team notification as soon as all of the deal is closed.

So, all of that is good. But we wanted to dive deep into what are the technical details that powers this. And let's look at what are the building blocks of voice assistant. So the bottom layer is basically ASR, which is Automatic Speech Recognition. Here, we're using ASR on-device. So we're using Apple and Google Voice for the voice APIs, so no big deal. That's the easiest part. The most critical part is basically your next layer, which is natural language understanding. So natural language understanding is basically doing a lot of things behind the scenes. So our conversational service basically looks at all of your unstructured data and tries to interpret that text, understand the context that you're speaking about, and not only that, it also tries to build a complex relationship graph behind the scenes.

So in our example, we saw Acme Corporation is a company. Then there's Chris, who is a relevant contact to Acme Corporation. And then there are records that are related to that that need to be updated. At the end of it, you need to integrate it to your systems. So that is the CRM integration part. So essentially, it’s not just interpreting and translating all of your unstructured text to structured text, but it's also helping users to be very productive and getting all of their insights into the system at the right time. Those are the building blocks.

Conversational AI Ecosystem

So now let's look at what the ecosystem is. Now, in Salesforce, we build everything platform-first. So at the bottom layer is what we call the Einstein platform, which holds all of your automatic speech recognition models, natural language understanding models, and then there is system-specific metadata and databases. On top of that is what we call as a conversational API, which is exposing all of your natural language understanding services. So we'll dive in deep in the rest of the talk on each of those elements in this layer. And on top of this API is what we build, what is known as a conversational apps. So what you just saw is just an instance of a conversational app. So Einstein Voice assistant is a conversational app that was built on this services layer, but there are also several other apps that are publicly available. Right now there's also Einstein Voice Bots. So customers can build voice bots for their businesses. We also have smart speaker integration. So you can ask Alexa or Google Assistant, "Hey, pull my daily briefing for today for Salesforce," for instance.

We also have another cool app coming up, which is voice navigation. So you can use voice navigation to navigate through your apps and your analytics dashboard. So you could say simply things like, "Hey, pull up my dashboard today for my top 10 accounts," for instance. This is the entire ecosystem that powers it.

Now, let's dive in deep into the conversational API layer, which serves all your natural language understanding needs. So the first step to actually begin translating all of that unstructured text to structured text is something called named entity recognition. Now, if you're familiar with a lot of natural language processing tasks, named entity recognition is a kind of an NLP task where you are extracting information about entities.

So let's look at an example to follow it better. So here is a piece of text. And what the NER does is basically take this piece of text, break it down into tokens, and understand if each token can be assigned to a predefined category. So, for example, when it looks at a word called "The committee of," it cannot recognize if it's an entity, so it puts it in a bucket called "other." So "O" stands for “other”. And this is a standard annotation format called the CoNLL format in natural language processing. When it looks at a name like McLeese, it understands it's a person entity. So it classifies McLeese as a person.

Now, there's something more interesting happening here. It's not just a single token, but also a sequence of tokens that can actually get assigned to a predefined category. So, for example, here, the end of this month actually is a date, but it's not telling in a straightforward way that this is a date. If it was July 1st, for example, it could easily tell us July 1st, and it's a date entity. But here the sequence model has to learn that the end of this month is also a date. So this way, NER is like a neural network that learns behind the scenes how to classify each of these tokens into predefined categories.

So the model that we have built is called NER7, because it recognizes seven entities. It recognizes person, organization, location, and numerical data like date, time, money, and percentage. The model also has an "other" class if it cannot classify to any of these seven entities. So behind the scenes, we have trained the NER7 model on a variety of data sets. There is a big corpus of open source dataset called MUC 6 and 7, which is pretty standard in the NER world. We have also mixed that with custom data sets based on the information that we have gleaned from historical notes or customer sales meetings that customers have provided us. So based on this data set, we were able to train the neural network to pretty much 90% accuracy. And it's working really, really well for these seven entities.

So fundamentally, we are asking the question, what are the entities in the text? So going back to the example where Amy dictated out her meeting note, it identifies Chris Hopkins as a person, Acme as an organization, July 1st as a date, $250,000 as currency or money. And then when it sees a phrase like “two weeks”, it classifies that as a date, but also does one additional thing, which is normalization. By normalization, I mean it takes a phrase like “two weeks”, it understands it's a date and then converts it into a consistent format, a date format, so that it is always formatted in the same way. This would be like a date two weeks exactly from the day that the note was dictated.

So having done the named entity recognition, we move on to something specific to your systems. So once you have entities figured out, you want to understand what does an entity mean; you have organization, you have person, but what does it mean in your system? So the question we are fundamentally asking is, is this entity in my CRM or in my system?

So going back to the example. Once you have found Acme Corporation, we do a search in the Salesforce database or any data system that you will integrate with. And this is not a blind search. Typically, Salesforce has hundreds of thousands of customers with different customer schemas. So a blind search would be really, really bad. So we apply heuristics like most recently used, frequently used, and also we allow users to set up some kind of setup beforehand, where they can say what kind of records are relevant to their customer meetings, for example. And this happens to an admin interface behind the scenes. But the gist of it is we have a lot of heuristics that can be applied instead of doing a blind search. So once that search succeeds and there are records that are matched for Acme, these records are then sent back to users to disambiguate.

It often happens that you cannot just find a single exact match for Acme, and you often return top N results to the user. But using that heuristics, we make sure that they're ranked and relevant to the user. So Amy now looks at several records that are shown to her and then she picks the one that she thinks is the closest to the deal. So, for example, here, the algorithm actually returns back Acme Corporation, but it also shows what were the relevant deals for that company. And Amy is able to choose like, "Okay. Here is Acme, 5,000 widgets, and that's my record. That's my deal that I'm closing today."

So once entity resolution is done, as we keep going through the stages, it's very important to maintain some kind of a state. So this is something that we call context management. So we ask ourselves a question: what data have we seen so far? So in our example, now that we know Acme Corporation has been extracted as an entity, and we also found the matching record for Acme, we save the context with the specific record ID for that account. And the next screen when Amy is selecting the person, Einstein surfaces Chris Hopkins as a person relevant to the deal. Einstein is asking a question. Did we see the organization in this context? It looks at Chris Hopkins, but then it's trying to build a relationship graph here. It's trying to understand how is Chris Hopkins related to Acme Corporation? And now that we have Acme Corporation already resolved and in our context, it understands that I need to search for Chris Hopkins but within the scope of Acme Corporation. And that's how it surfaces a truly relevant contact for Acme Corporation.

Now, this is the final part that we have seen on the last screen when we are going through the demo, which is text classification. This is the most critical piece apart from building the relationship graph, because you want to know what was Amy intending to do. What was her intention, and what are maybe the actions that she wants to take, based on those intents?

So let's look at the example of the line "Follow up call with Chris in two weeks." Now, this is very intuitive for you and me as humans, okay. That's pretty straightforward, "Follow up call with Chris in two weeks." So okay, you might have a task or a reminder that you would set up manually. But for the algorithm to understand what is the intention and what action needs to be taken, there has to be a model behind the scenes that can do that. So basically, we send a prediction request to our APIs, our conversational language APIs. And this API already uses a pre-trained model. So we have models that have gone through a lot of trained data sets that understand these kinds of sentences, and they know that this is a kind of intent that Amy is deciding to do. And so the language API sends back a JSON payload. So the payload basically shows you what are the different probabilities of different intentions. So this intent model basically took that whole sentence, understood that the intention could be "create," or "update record," and sends back what it thinks are the probabilities for each of those actions. So here, CREATE comes back with a 99% probability. So Einstein understands that, "Okay. I need to create a record for this." Okay.

So now that we have understood the intention, there is a final piece which is missing, which is slot filling. So in NLP world or in chatbots world, if you have heard, slot filling is basically, once you understand what is the action that you want to take, you have to fill in certain parameters for your action. So by that I mean, for each action item, you want to figure out what are the things that you want to plug in. So yes, you want to update the record, for instance. But you can only update the record if you have the date parameter and the amount parameter filled in. So it figures out the date of July 1st and $250,000 from it. And then when it creates a task, it still sends a request to NER on the backend to figure out Chris as a person entity, and two weeks as a date entity, and then fills in those slots for create task action. And again, here, the date is always normalized. When you say "in two weeks," it basically converts it into a consistent format, an actual date two weeks from now.


So that wraps up like the technical details of the natural language understanding. But with any AI project, there are loads and loads of challenges and lessons learned. So the biggest is obviously data challenges. So Salesforce, we have this unique complex problem. So we are a heterogeneous database. So basically that means on a scale of 150,000 customers or so, they are able to actually create their own custom schemas.

By that I mean, let's take a very clear example of what that means. So if you had to create an account record in Salesforce, the pretty much standard way is you have an account ID, name, phone, email, and things like that. But if you were, say a banking company and you're using Salesforce, you might have a custom schema which has like oh, maybe a checking account number or a savings account number or whatever. So accounts might mean different things to different people. If you think about medical professionals, they might have different things in their account, like a medical record number, for example. So customers are constantly customizing their schemas all the time. This presents a big problem for something as generalized as AI and NLP, because customers are defining custom schemas, schemas are not consistent. They're not stable, they're changing all the time.

So a big question to ask is how do you make AI and NLP and all these algorithms work for every customer schema? We'll talk about solutions a little later on, but this is one of the biggest challenges of working with enterprise data.

Another big challenge is, well we saw that a lot of records are written for Acme and we apply a lot of heuristics. But which Acme Corp did you actually mean? By this I mean, in the price data, there are a lot of duplicates. So if you actually look into any business data, intentionally or unintentionally there are records that are duplicates of each other. So there could be actually 100 Acme records, but maybe only 10 make sense. And this is always the state with enterprise data. So lots of duplicates, identifying the most relevant Acme in our case. And also there is one big thing, a side effect of this, is it affects and creates a very frustrating user experience. You don't want to show Amy 50 records and ask her to go through every record and pick which one she actually meant. Then totally defeats the purpose of using a voice assistant and making her life easy.

Now the next biggest part is automatic speech recognition. Now, we are all very familiar, we keep asking, "Alexa, I just said that. Why don't you understand?" So this is an ongoing thing. But when it comes to a domain-specific use case as us, we have several more challenges with automatic speech recognition. For example, what do we do with domain-specific jargon? So let's say the example of a medical professional using voice assistant and taking notes after meeting with her patient today, and then she maybe utters words like “ER” or “MRI” or some instruments that ASR understands nothing about. So this could end up with misspelled names, and then downstream, it's all gone, like there is no way that you can classify anything like that. So the first entry point is automatic speech recognition. And we have a big, big problem with domain-specific jargon.

So one of the reasons why we are actually moving towards building automatic speech recognition in-house in Salesforce, is we want to provide a way for customers to customize their models and provide their own custom jargon as dictionaries, for example. And then there is this whole scenario on audio environment. We're not living in a perfect world. And ASR doesn't catch your words sometimes. It could be because of noisy settings. And this happens a lot in business settings, because one of the businesses is basically a big construction company, and they're always on the field. There's always a lot of noise happening, and you don't expect them to go to a really quiet room to record their meeting notes. They have to record it right then and there. So we are also building optimized models to catch and simulate such acoustic features and then introduce noise in our dataset to make sure that those words are effectively caught by automatic speech recognition.

And then there's this big hot topic about bias and ethics in AI. And we saw this in our data sets as well; most of our datasets were skewed towards western U.S. population. So a male speaker from the U.S. would speak and it works perfectly fine. But then somebody else comes with a southern dialect and things go haywire. And you want to have a really good distribution of data sets that works across linguistic profiles. Imagine if you're a non-native speaker and English is not your first language, automatic speech recognition is no way going to be perfect for that person. So this is a big ethics issue, and we have to constantly think about how to get away from biased data, because a lot of open source and licensed datasets still work only really, really well for the U.S. population, but not outside of it.

Now, coming to named entity recognition. That's not perfect either. Named entity recognition actually sounds really intuitive and easy for humans, but it's so, so hard for machines to even figure out what are the entities that you are talking about. And here we have a few examples where named entity recognition perfectly flopped. So one such instance is there's a sentence called "Today, JP Morgan and I spoke about …" And then when you look at the sentence, you and me can understand it's most probably JP Morgan is a person that you met. But, NER kind of gets confused. It's like, "Oh, I thought JP Morgan was a company because that's probably somewhere in my dataset and I thought that this is a company." There's this ambiguity of whether is it a company or a person. And the probabilities are pretty similar, because a sequence model has actually learned that when you have a sentence, something like, "X and I spoke about," it understands, "Oh, we're talking about persons." But then at the same time, if it looks at an entity that seems similar to an organization name, it also comes back with a probability of maybe like 50%, and then you don't know which one to choose.

This next example is basically an example where the sentence reads "... the San Juan center is led by a team of scientists." Now for you and me, we may not often use capitalized letters for locations. But for NER, it's being trained on datasets where it has always seen structured location entities that start with capitalized letters. So it sometimes cannot identify San Juan as a location, because of this case sensitivity. And when we realized this, we had to augment our data sets to intentionally introduce this noise, because when we first trained this on a huge corpus of dataset, it was on this huge corpus of Wall Street Journal articles, and typically, press articles are very, very formatted and structured. So all of your locations had the first letters were capitalized. But then when you and me talk, or you're entering text, we don't really use capitalized letters all the time.

Now, this was very interesting. So I used the voice assistant to record my own name. So I said, "Manju and I met today at Starbucks." And the ASR said, "Oh, okay, your name sounds like ‘man joy’. I'm going to just split it and make it two words, ‘man joy’." So this is a spell check error. Obviously not an NLP issue, but then it creates an issue for your NLP downstream, because there is no way it can catch misspelled pronouns. Misspelled pronouns are the hardest to catch for natural language understanding. There is no way it can figure out the structure or understand that that was a name.

Again, going back to the ASR models that we are building in-house, these are some of the things that we would allow our customers to customize. If it is a Japanese company, for example, and they're using a lot of Japanese names, we would allow them to customize their models with a custom dataset where there are a lot of Japanese entities in the data set, for example.

Future Considerations

Let's look at some of the future considerations. So one is building end-to-end optimized models. So basically that means allowing our customers with a lot of different custom schemas to have their own dictionaries, so they are able to configure training data sets that works really, really well for their verticals. Right now we use just a single non-customer-specific model, which may not generalize that well with all verticals. We also are working on models that can normalize your voice inputs. Like I said, when you're in a noisy setting, ASR is not able to catch your words really well. So we have introduced specific acoustic features in our models to create and simulate a noisy environment.

Now, one crucial piece that we are actually missing in this iteration, is feedback. So when you're working on NLP algorithms, the second step after you devise a really good algorithm that works well, and great neural networks, is to actually collect feedback. By feedback, I mean, did this prediction work for you or not. So, for example, say they misclassified Chris Hopkins as not the relevant contact on Acme for Amy. Amy should be able to tell the system that this was not the Chris Hopkins that I'm searching for. And the way we actually could pick that signal up is we provide a failback mechanism in our voice assistant, where Amy can actually end up doing a manual search; if she is not happy with the records that we returned back, she can press a button and then manually search for Chris Hopkins and get the record back. But then that's not a great user experience. But when she does that, we get a signal from the user, saying that hey, whatever records we surfaced to Amy actually didn't work out, because she didn't choose any of those records. She actually went and manually searched.

Now, once we have that feedback, what do we do with that feedback? We have to retrain our models to understand this better and make it accurate. You have to feed that back as input into our neural networks, retrain the model so it gets better and understands, as each time the user uses it, it gets better and better, and it needs all these signals. So that is not right now in our first iteration, but that's something that we are building.

And the last and maybe most important, but people don't think about it so much, is the shift that's happening to voice. So we believe that with voice, we are where mobile was maybe five or six years ago, where there was huge shift in a lot of companies thinking about, "Oh, what is my mobile-first strategy?" And they had to really reformat and revamp all their platforms and applications to work with mobile. Traditionally desktop apps had to become like mobile-ready. And it's the same forefront that we think that voice is at. And it's a great opportunity for us to understand how would you design for users to use voice, and then guide them to that user experience? We think voice is going to become the next user interface for folks. But at the same time, it's a big design challenge.

And also one big thing with voice is you're not always going to be using a mobile app everywhere for voice, although our first iteration is a mobile app. But you will be in different settings all the time. So when you follow a business user throughout their day, they're sitting in meetings in a conference room, and then you have smart speakers over there. And then she's out of the room, she's at the mobile app. And she's always on the move. And you want to provide a universal voice interface which actually syncs all of the data and works in a very frictionless way. It cannot be tied down to say, one device. Today we work with just Alexa or Google Home, but it has to work seamlessly, universally. So this is something that we are thinking in long-term and every company needs to think about long-term strategy of using a voice and a guided user experience.

What’s Next for NLP and AI?

Now we coming to a really exciting part. So in research with NLP and AI, we have seen these milestones happen over the past. We started with machine learning where folks handcrafted features. So basically, you would set up these rules and you figure out the features. Most of the work is on the human side; the humans are figuring out the features and then setting up systems to learn from those features. We evolved from machine learning with feature engineering, more towards deep learning, where your models are learning by themselves about these different features, and humans are not handcrafting that much. And then we have evolved to more deep learning neural network architectures that are really good at doing single tasks.

So for example, today, if you want to write an algorithm to do question answering, for instance, you have really fantastic algorithms that work great for that task. But what if you want to do both question answering and sentiment analysis? You would need to have two models and swap each model for every time. So this would be the case even for assistant, because we started out with assistant just interpreting and doing NER and intent with your text. But tomorrow we also want to do summarization of the notes, for example. And we want assistant to become really good at summarization. But what would we do? Would we swap models for each task? This is a question that we often ask ourselves.

And luckily, Salesforce AI research was looking into the same big problem in NLP. And they have published a exciting new paper, which is called the Natural Language Decathlon. We believe that the next forefront of NLP is basically building a single model, but that works really great for multiple tasks. So by Natural Language Decathlon, we mean doing 10 different NLP tasks, like question answering, machine translation, summarization, and so on and so forth, with really, really well accuracy. Now, that's a very big problem to attack. But this is like the start of the research, and this is something that we're really excited about. And if you want to learn more about this, all of that code and all of the paper and implementations are open and they are at You'll find all of the research and implementations over there. And all of our code is in GitHub.

So going back to that opening scene, Pete and Monica are talking about voice recognition. It turns out that was pretty much not the complete scene. There was still some scene remaining. So Pete says, "It's not going to be able to do any of those things, but it will understand what you're saying." And I think we are at that stage right now. NLP and voice have become really, really good at understanding what humans mean.

But language understanding is still AI Complete. So when I say AI Complete, it means we need to have a more generalized AI to fully understand language. And why is that? It's because when we think about language as humans, we're not thinking just about words and semantics, but we're also thinking about tone, about context, about cultural references, and things like that. So there's a lot of activity going on in your brain to actually figure out what a piece of sentence means. And to get there, you need a more generalized AI. But that shouldn't deter you from doing what are you doing right now. We believe in Salesforce; if we focus on solving customer pain points in your domain and get really well with that and iterate and get feedback from customers, you can reach there. You don't have to wait for a more general AI to help you with this. And voice assistant is our first entry into doing that.

We also think voice will become the next big user experience, the next big user interface. So we need to start thinking about voice as a design challenge also, like how are we going to get all of those users to use voice to go about their work every day? And this is something that it's really slow and gradual, but they're going to get really quick once it hits off, just like how iPhones changed the deal for touch interfaces.

So that is the end of the talk. I hope that you found something great from this talk. A lot of resources there, but one-stop shop is basically And you will see a lot of blogs, published papers, research, and demos about voice assistant and for the products over there. Thank you very much.

Questions and Answers

Participant 1: You mentioned about combining multiple features or multiple capabilities in one model. How would you do metrics and performance measurement in that kind of scenario?

Vijayakumar: Right now we have different evaluation metrics for each of those because that's a standard. Basically, natural language processing for each of the tasks that you do, like question answering or summarization, there's a different set of benchmark metrics. So what we do in Natural Language Decathlon is basically we model everything as a question answering task. So for example, if you were to do sentiment analysis on a piece of text, say the text was something like, "Manju gave a talk, but nobody clapped." So your sentiment analysis, if it were a sentiment analysis model, it would understand, "Oh, is she sad or happy?" But what we do in this model is basically we ask that question.

So when you see that text, we model it as a question answering task. And we ask that question, "Oh, was she sad or happy?" And then we spit out the output saying sad or happy as a sentiment. So we are driving more towards using a question answering as our basic model, and then trying to make sure that all of the metrics that work for question answering actually work really well for the other tasks as well. But we're not quite there. It's a great question, but we haven't fully solved that as yet.

Participant 2: I noticed that in your demo when you do the voice transcription, it automatically creates an itemized list. Is it one thing that your system does? How does it do it?

Vijayakumar: So right now for our first iteration, we cheated a little bit. If you notice on the demo, we have a button called Action item. So users can actually press on that button and create action items, which makes it very easy for algorithms to pick it up, because when it was a completely unstructured document, we were not as good with accuracy. So right now for our first iteration, we go with just an itemized list of actions.


See more presentations with transcripts


Recorded at:

Jan 22, 2019