Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Building Intelligent Conversational Interfaces

Building Intelligent Conversational Interfaces

Leia em Português

Key Takeaways

  • Create a mind map of interaction between the user and a bot.
  • Build the workflow and make it conversational and personalized.
  • Natural Language Understanding (NLU) enables users to converse naturally.
  • Users might take many turns to convey a message or perform a task.
  • Build and design a bot once, deploy it to many platforms.

The use of smart speakers and conversational devices has been on the rise in the last couple of years. Now more than 66 million adults in the United States have a smart speaker - that is nearly a quarter of the country is conversing with devices. Despite the wide penetration, we’re just getting started in realizing the full potential of these devices.

We, at Passage AI, have built a platform that enables enterprises to build intelligent conversational applications and skills.

This blog post is a sneak peak and an introduction into what happens behind the conversational AI technology.

Developing an enterprise-grade skill for a conversational device involves three components:

  1. Interaction Flow. Interaction Flow involves defining and building the interaction the users are going to have to achieve a goal, or troubleshoot, or get the question answered.
  2. Natural Language Understanding (NLU). This makes the bot understand and respond in natural language and includes things like intent classification, slot filling, semantic search, question answering, sentiment understanding and response generation.
  3. Deployment. Once we’ve defined, built and added NLU to the interface, we now have to deploy this to various channels. This includes voice interfaces like Google Home, Microsoft Cortana and Amazon Echo, messaging channels like Facebook Messenger, Android Business Messaging and Slack and even a pop-up chat-client that can be integrated to the website.

Interaction Flow

An interaction flow is a mind map of the interaction between the user and a bot (conversational interface). In order to build an interaction flow, we have found it helpful to design the interaction flow first before actually building it out. This forces us to think outside the happy path - the flow we want the user to take. Below is an example interaction flow where we design not just the happy path but also more complicated flows.

Figure 1: Happy path interaction flow on the left and a more complicated flow on the right.

Here are the top three lessons we have learnt while designing conversational flows.

Make it conversational and personable. An intelligent conversational bot should not sound robotic. Instead of saying the same messages, the bot could be configured to pick randomly from a list. Another way is to give some personality to your conversational interface. People are chatting or talking to this interface and, for example, to begin with, instead of just saying, "Hi. How can I help you?", make it more personable and say something like, "Hello Tom. Hope you’re having a great weekend. What can I do for you today?"

Figure 2: Make it conversational and personable.

Know the context. The response that the bot needs to generate depends not only on the previous user message but also on the context. The context includes many things like the past conversations between the bot and the user, the modality of the platform (voice-based vs text-based), the knowledge or experience of the user with respect to the product (first-time [naive] user or a repeat [super] user) and the stage of the entire workflow that the user is in.

Figure 3: Bot response generated using the context.

Gracefully handle error cases. No conversational interface will have complete knowledge of the world and hence it is bound to make a few mistakes. Here are some tips to minimise or eliminate them:

  1. Seek confirmation on what the user has said so far.
  2. Ask for clarification if the bot needs more information.
  3. Gracefully let the user know that the bot did not understand the message.

Conversational Workflow Building Blocks

Once we have defined the interaction, the next step is to build the conversational workflow using abstractions like intents, variables, webhooks to call an API, decision trees to do troubleshooting, and knowledge base for question answering.

Intent is the basic building block for any conversational interface, which captures the meaning behind user messages. Intents could comprise of keywords, variables and webhooks to perform an action. Keywords represent the various phrases a user might say to refer to the intent. The various keywords of an intent along with real user messages (labeled) for the training data for intent classification (see NLU section for more details). In the case of customer service, we can define intents to Track Order and to talk to Customer Service Agent.

Variables define the input that we need to get from the user to perform this intent. For example, in order to track a shopping order, we need to get the order ID from the user. Once we have identified the intent and got the required variables, we need to perform an action. In this case, we can make a webhook to make an API call to get the order status.

Decision Trees. Many of the use cases require troubleshooting, require making certain decisions and responding to how users are providing their information. Decision Trees are a great way to solve this kind of problem. Having a way to define a decision tree and a control flow is an important building block to define a workflow.

Knowledge Base: Many of the conversational agents, in case of customer service, are answering user’s questions. Knowledge base is a way to input frequently asked questions (FAQs) and be able to find the right answer to the user’s question.

Natural Language Processing (NLP)

In this section, we are going to discuss about various conversational AI building blocks like intent classification, slot filling, dialog state tracking, semantic search and machine reading comprehension. Before we dive deeper into these building blocks, let’s get a basic understanding of deep learning and embeddings.

Deep Learning. In traditional machine learning, a lot of time is spent in hand-crafting features such as, is this word a stop-word or not, does this word belong to a country, or a location feature. A machine learned model or a function combines these features to predict a response. A couple of examples of traditional machine learning techniques are logistic regression and gradient-boosted decision trees. On the other hand, in deep learning there is no need for hand-crafted features. The raw input is represented as a vector and the model learns the various interactions between the elements in this vector to tune the weights and predict the response.

Deep learning has been successfully applied for various natural language processing (NLP) tasks such as text classification, information extraction, language translation, etc. Two prominent techniques in deep learning applications for NLP are Embedding and Recurrent Neural Networks such as Long-Short Term Memory Network (LSTM).

Embedding. The first step in any deep learning application for NLP is embedding, which is to convert sentences to vectors. There are four kinds of embeddings:

  1. non-contextual word embedding,
  2. contextual word embedding,
  3. sentence embedding and
  4. subword embeddings.

The popular techniques in non-contextual word embeddings are GloVe and Word2Vec. Word2Vec was the first popular word embedding introduced by Google in 2013. The core concept behind this is to map similar words to similar vectors in a high dimensional space such that the similarity between the vectors (measured by dot product or cosine similarity) is high.
Figure 4: While developing Word2Vec, researchers at Google observed a nice side effect of this is that, it enables applying analogies like the vector of "King" minus vector of "Man," plus vector of "Woman" is very close to vector representation of "Queen".

At a very high level, to train a Word2Vec model, we take a large corpus (like Wikipedia) and convert it to a list of word pairs. For example, "I want to track my order" can be converted to [(I, want), (want, to), (want, I), (to, track), (to, want), (track, my), (track, to) and so on] using a window size of 1. We then feed one word to the neural network by converting it to a one-hot vector. The input is then projected onto a lower dimension space (which form our embeddings) and then projected back to the vocabulary size and we take a softmax to predict the target word. Words like "track" and "find" have similar predictions most of the time and hence are projected onto similar embeddings. As a result, the cosine similarity of "track" and "find" would be high.

Figure 5: Training Architecture for Word2Vec - Figure from Chris McCormick blog post.

While Word2Vec achieved a lot of success, it is non-contextual, i.e., the same word used in different contexts would have the same vector representation. More recently, contextualized word vectors like ElMo and BERT have become popular. Both of them come from a family of NLP problems known as language modeling which learns the likelihood of occurrence of a word based on the surrounding words. The word "play" may mean a word to play a sport, or it can also mean a theatrical play depending on context. The big success of BERT could be attributed to bidirectional language models - while most embeddings work on a shallow concatenation of uni-directional language models (both in forward and backward directions), BERT uses a bidirectional language model where certain words in a sentence are masked and the language model tries to predict them. This enables to keep the architecture constant or simple for the downstream tasks.

Another embedding technique introduced recently is sentence embedding, where the whole sentence is embedded into a vector. Word embedding is very powerful, but sometimes one vector representation of a sentence is very hard to come up from multiple words in the sentence. Simple techniques like averaging, max, or summing word vectors are an approximation but don’t work well in practice. A popular technique here is known as skip-thought vectors, which takes the word2vec technique and applies it to sentences. We embed similar sentences to similar vectors in a high dimensional space.

Long-Short Term Memory Network (LSTM). The second important technique in deep learning applications for NLP is Recurrent Neural Network (RNN). Long-Short Term Memory Network (LSTM) is a variant of RNN that has been successfully applied for various supervised NLP tasks such as text classification. LSTM are a special kind of Recurrent Neural Network, capable of learning long-term dependencies between words and avoid problems like vanishing gradient. They achieve this by using a mechanism known as gating - a way to optionally let information through. LSTMs have 3 gates - an input gate, an output gate and a forget gate.

The forget gate decides how much information must flow from the previous cell state. The input gate decides how much information must flow from the current input and previous hidden state and the output gate decides how much information must flow to the current hidden state.

Figure 6: LSTM Cell Diagram

Given a sentence, the goal of text classification is to find if it belongs to any of the desired class. A standard approach in solving text classification is to get a sentence representation of the text (fixed size vector) and use this information to pick the class. While there are many ways to get the fixed size vector from a sentence, a standard approach is to feed the word embeddings of the message to a Bi-directional LSTM (Bi-LSTM) and take the final layer as the representation of the sentence.
Fig 7: Unrolling an LSTM layer.

While intent detection is a text classification problem, slot filling and information extraction belongs to a class of problems known as sequence labeling. In this class of problems, every word or token in a sentence is assigned a label, and the goal is to predict the right label for each word. As above, we can pass a sentence through a Bi-LSTM and predict the label for each word.

A common problem in the customer service domain is to build a conversational interface around a knowledge base like frequently asked questions (FAQs). In this problem, we’re given a number of question-answer pairs and the goal is to match a user message with the right answer. One way to approach this problem is as a traditional Information Retrieval (IR) problem - the user message acts as the ‘query’ and the FAQs acts as the corpus. An inverted index with postings list is created on the corpus for fast retrieval and traditional scoring techniques like TF-IDF is used for scoring. While this helps us retrieve the most relevant answer, sometimes it could be too long to be consumed by a conversational interface.

In Machine Reading Comprehension, or Question Answering, you are given a piece of text or context and a query, the goal is to identify the part of the text that answers the question. The combination of long-short term memory network and attention model is used for finding the answer in the context or piece of text. At a high level, you feed the context or passage of text through LSTM layers with word embedding and character embedding, also the query or questions, and you compute pairwise query to context and context to query attention, and again, apply bidirectional LSTM networks to get the start and end of the answer a piece of text. This is a very active area of research, in the last couple of years, there has been a lot of progress in machine reading comprehension.

Dialog understanding or Dialog State Tracking is an active research area. Many times, users don’t give all the information needed to achieve a task in a single turn. The bot has to converse with the user and navigate the user to achieve the task (to track an order for example). Maintaining the "state" of the dialogue and extracting information across different set of messages is key to dialog understanding. This enables the user to go back and forth, change the value of certain variables and complete the task seamlessly.


The number of conversational platforms have been on the rise - Google Assistant, Amazon Echo, Facebook Messenger, Apple iMessage, Slack, Twilio, Samsung Bixby and even a traditional IVR. As the number of platforms grow, it becomes a nightmare for a developer to build individual bots to these various platforms. The challenge in building a middleware that integrates with all these platforms is understanding the differences and similarities between these interfaces, and also keeping up with the constant changes of these platforms.

Two other things to keep in mind under the "deployment" umbrella are bot versioning and bot testing. Application development platforms like Apple store and Google Play Store maintain different versions of the app. Similarly, a bot development platform needs to maintain different versions of the bot. Versioning enables you to easily roll back your changes and also have a history of the changes that have been deployed. Bot testing is not trivial as the bot response not only depends on the user message but also the context. Besides end-to-end testing, one might also test the sub-components like NLU individually for easier debugging and faster iteration.


This article gives an overview of building blocks of an intelligent conversational interface. Conversational AI is an emerging area and some best practices are evolving. We envision a future where one would be able to converse with all the devices, be able to instruct a car to control various functions, a virtual agent plans and books the next trip, and when one calls the customer service of an internet service provider, a virtual assistant immediately and accurately answers the questions.


  1. Word2Vec Explained
  2. The Illustrated BERT
  3. Understanding LSTMs
  4. Gartner Report on best practices for conversational interface strategy

About the Authors

Kaushik Rangadurai is an early engineer at Passage AI primarily working on Dialog and Natural Language Understanding. He has over 8 years of experience building AI driven products for companies like LinkedIn and Google. Prior to that he received his Masters in Computer Science at the Georgia Institute of Technology at Atlanta specializing in Machine Learning.

Mitul Tiwari is the CTO and Co-founder of Passage.AI. His expertise lies in building data-driven products using AI, Machine Learning and big data technologies. Previously he was the head of People You May Know and Growth Relevance at LinkedIn, where he led technical innovations in large-scale social recommender systems. Prior to that, he worked at Kosmix (now Walmart Labs) on web-scale document and query categorization. He earned his PhD in Computer Science from the University of Texas at Austin and his undergraduate degree from the Indian Institute of Technology, Bombay. He has also co-authored more than twenty publications in top conferences such as KDD, WWW, RecSys, VLDB, SIGIR, CIKM, and SPAA.

Rate this Article