Key Takeaways
- AI assistants can be classified into 5 levels of conversational AI maturity, defined by their capabilities
- Level 3 assistants, those that can handle naturally-sounding, multi-turn interactions, are hard to build
- Rasa is an open source conversational AI framework that allows developers to build level 3 AI assistants
- Challenging natural language problems like entity disambiguation can be addressed using applied NLP research and Rasa
- AI assistants should be monitored and iteratively improved using NLU and business metrics, and real user conversations
Conversational AI has been experiencing a renewed focus in recent years. In the past few years, we’ve seen language models achieve state-of-the-art results, demonstrate impressive results with language understanding benchmarks like General Language Understanding (GLUE) and SuperGLUE, and lend themselves to practical applications. Even so, conversational AI is far from being solved. However, we’re moving to an AI- first world, where people expect technology to be naturally conversational, thoughtfully contextual, and intelligent -- and so most companies will have to consider adopting an AI assistant sooner or later.
In this article, I’ll first discuss the five levels of AI assistants using a standard model for conversational AI maturity. Second, I’ll summarize my own recent experience building a level 3 AI assistant. Finally, I’ll outline various custom tools I built to continuously iterate upon, improve, and monitor the AI assistant in production.
The Five Levels of AI Assistants
Most AI assistants today can handle simple questions, and they often reply with prebuilt responses based on rule-based conversation processing. For instance, if a user says X, respond with Y; if a user says Z, call a REST API, and so forth. However, for AI assistants to provide value to business functions like customer service, supply chain management, and healthcare workflow processes, we need to move beyond the limitations of rule-based assistants and to a more standard maturity model for conversational AI. In this article, we’ll talk about how to model and deploy a contextual assistant and discuss real life examples of contextual assistants in production.
There are five different levels of conversational AI maturity, defined by their capabilities. These defined levels allow us to measure the AI assistant’s progress to see where we are, and where we’d like to go in order to achieve or align with business outcomes.
Level 1
At Level 1, the bot is a traditional notification assistant. It can send you notifications about events or reminders about things in which you’ve explicitly expressed interest. In other words, the assistant sends out preprogrammed notifications or responds to events that are triggered by users. In this case, a help desk assistant might send you a notification about the status change of your help desk ticket.
Level 2
At Level 2, the assistant can answer FAQs and engage in simple dialogues. The dialogues are pre-built, and the assistant relies heavily on intents, entities, and rules. In this case, the assistant may answer some FAQs but will get perplexed should the user engage in interjections or unexpected utterances.
Most assistants today are at level 2; they’re built using rule-based dialogues or state machines. In this setup, the developer uses a combination of intents, entities and if/else conditions to build dialogues. Observe the code snippet below. The assistant has to rely on conditional statements to gather information and respond to the user.
Observe the resulting conversation below. When the user asks an off-topic question in the middle of a dialogue, the assistant gets confused and is not able to respond in a relevant manner. This is because the dialogue is built using if/else statements and is not able to recognize this new and unexpected conversation path.
Level 3
Level 3 assistants are typically able to engage in flexible back-and-forth dialogue. In addition, assistants at this stage are capable of handling user corrections, interjections, chitchat, and sub-dialogues. This is the type of contextual assistants most organizations are attempting to build today.
Levels 4 & 5
At Level 4, the assistant is able to remember your preferences and offer a personalized experience. At Level 5 and above, assistants would be able to monitor and manage a host of other assistants and effectively run certain aspects of enterprise operations. Level 4 and 5 assistants do not exist today.
Case Study of Building an AI Help Desk Assistant
I’ve spent a few years building AI assistants and leading teams that shipped contextual assistants to production. Building contextual assistants is hard. Building contextual assistants that actually work, and drive measurable results, is harder still.
One of the contextual assistants I built last year for an enterprise company was an employee-facing help desk assistant. The goal was to automate a portion of help desk tickets in order to reduce costs. An AI assistant that answers questions is useful; however, AI assistants that perform task execution on behalf of users in addition to answering questions, and nudging them to make informed decisions, drive even more value.
The help desk assistant we were building had to answer questions about routine technical issues, assist with issue resolution, and perform task execution and follow-ups on behalf of users, none of which reflected predictable conversation paths. Human language is messy and unpredictable; building state machines, or rule-based processing, that attempt to script out possible conversation paths can be incredibly difficult to scale and maintain.
Therefore, we had to use machine learning-powered dialogue management or risk maintaining a large system with thousands of lines of code. Machine learning-powered dialogue management enables AI assistants to train on real user conversations, learn patterns and context, and predict appropriate and sensible responses to queries.
We started to build and iterate upon a level 3 assistant using Rasa, an open-source platform that provides ML tools to build and deploy contextual assistants.
The following diagram is a high-level overview of the technology stack we used to build, model, and deploy the level 3 help desk assistant. Services like Rasa Core, Rasa NLU, and Rasa Actions were the foundation, or infrastructure layer, for the help desk assistant. Rasa Core is a machine learning-based dialogue manager, Rasa NLU is a customizable intent classification and entity extraction service, and Rasa Actions is an integration point to call external services. We deployed Duckling, an entity extraction service, and BERT, a language model used here for named entity recognition, using Azure Kubernetes Service (AKS). Named entity recognition is the task of extracting named entities from text. Named entities are things like locations, organizations, or personal names.
It’s important to note that this enterprise company was an Azure customer, and therefore the architecture was dictated by Azure services. The assistant was integrated with Azure Active Directory, ServiceNow, and Microsoft Outlook to authenticate users, create incident tickets, pull user profiles, and perform other tasks like meeting scheduling.
Rasa’s tracker store that maintains conversation history and the current state of a user’s conversation was backed by Azure Cosmos database. The assistant was deployed on Slack and Microsoft Teams via a Chrome extension and other front-end channels.
The following diagram is a high level representation of the help desk assistant’s devops process. A command line bootstrapper was custom-created, and used to initialize and set up the assistant’s code base. Azure DevOps service was used to version control the assistant’s models and source code. Azure pipeline was used to build and deploy the assistant to the various Kubernetes environments.
Note that Rasa now ships with an out-of-the-box bootstrapper to initialize and set up a contextual assistant.
Challenges with Entity Disambiguation
As the user base grew, so did the help desk assistant’s skills and the content it could potentially handle. It’s important to add that a lot of the captured data was noisy. We started noticing issues with entity disambiguation. For instance, if a friend were to say to you, “I’m on my cell,” you’d know that they most likely meant that they were on their cell phone and that the word “cell” may not have been in reference to a biological cell. Similarly, consider this example that confused the help desk assistant:
“I want to schedule a meeting with my team”
or something slightly different,
“I want to schedule a meeting using teams”
Remember that this company is a Microsoft customer, and Microsoft Teams was one of the communications software that employees used. When a user would ask to schedule a meeting with their team, they wanted the assistant to set up a meeting with a custom team or active directory group they had created; and when the user would ask to schedule a meeting using teams, they wanted the assistant to schedule a Microsoft teams meeting.
The assistant would confuse the entities in question -- “team” and “teams” -- and provide an irrelevant or incorrect response. Both sentences and words may seem similar but are quite different in the context of this organization. There were several similar occurrences where the help desk assistant would get confused and need additional training.
In April 2019, the team integrated the assistant with BERT (Bidirectional Encoder Representations from Transformers). BERT was seen to achieve state-of-the-art results on word sense disambiguation and other downstream NLP tasks. This is due to its ability to pre-train bidirectional, contextual language representations modeled on a large text corpus. It is therefore better equipped to resolve issues with entity disambiguation.
However, BERT presented its own challenges in that it was somewhat slow. A temporary solution was to create a wrapper to the BERT service, and fine-tune the service to load the model to memory, thereby speeding up the request processing. Integration with BERT solved some of the issues with entity disambiguation.
We shipped the conversational assistant to production and continued to collect user conversation data. This data was used to train the AI models and make continuous improvements to the assistant.
Make Continuous and Iterative Improvements
The AI assistant’s results were promising; it reduced calls and tickets and automated large portions of the help desk team’s processes so they no longer had to focus on repetitive and mundane tasks. Overall, it had compelling ROI that we measured using a set of metrics we published at the beginning of the project, and proactively maintained.
We created a custom testing and analytics tool to automate testing, and collect and visualize conversations, and measure important metrics. This tool was critical in implementing a continuous learning cycle, where real user conversations were collected, annotated, and used as training data for the assistant to learn from. Real user conversations provide valuable insight into user behavior and test the limits of your AI assistant. Therefore, it’s critical to augment your data set with real user conversations. In addition, the custom testing and analytics tool looked at mishandled and unhandled user requests and sent them to the design team for review. The tool also measured fallback, or the number of times the assistant defaulted with a generic fallback; success, or the number of the times the assistant responded with a correct answer or successfully resolved issues. Some of the other metrics that were measured were user retention rate, which tracks how many users came back to talk to the assistant, and sentiment, which identifies whether interactions were positive or negative.
We benchmarked these metrics; set weekly, monthly, and quarterly goals; and tracked the assistant’s progress and made improvements accordingly.
Issues with Multi-Turn Dialogues
Some of the issues with non-linear conversations, where the user introduces a new topic in the middle of the conversation or modifies a previous statement, remained. These types of multi-turn conversations are particularly challenging, and they also happen to be the way that most users actually talk. In an effort to resolve some of these issues, the team experimented with Rasa’s TED (Transformer Embedding Dialogue) policy. Using a transformer architecture, the TED policy can selectively pick which conversation turns to pay attention to, and which conversation turns to ignore.
Additionally, and perhaps, distinctively in comparison to recurrent neural network architectures, transformers use a self-attention mechanism, by which they’re able to choose which elements in a conversation to pay attention to, in order to make an accurate prediction. In other words, transformers are uniquely equipped to handle non-linear conversations where a user might change topics, engage in chitchat in the middle of a conversation, because they’re less likely to become perplexed when a user does something unexpected.
In addition, it provides hyperparameters that can be used to fine-tune the model. It’s been said more than once that hyperparameter tuning is sometimes more an art than a science, because it relies intensely on experimental results more than pure theory, and one has to continue to try out different combinations and evaluate each model’s performance to find the best-suited one.
Continuous monitoring of real user conversations and subsequent fixes, integration with BERT, utilization of the TED policy, and additional tooling to support conversational AI workflows helped to deliver a level 3 contextual assistant.
Next Steps
We’ve seen a shift towards open domain systems. That refers to assistants that are, in theory, unencumbered by a particular domain or topic, and are capable of talking about anything. This makes sense because we have massive amounts of data, and we have systems that are good at collecting and aggregating data. We also have the technological capabilities to tell a compelling story with this data, not merely chase the next benchmark or accuracy score.
While it is true that the field of natural language processing (NLU) has seen many recent advancements, today’s contextual assistants still have a long way to go because they don’t truly understand language or its relationship with the world. Statistical or language mimicry is not the same as language understanding.
Therefore, it’s useful to have a healthy amount of skepticism while evaluating language models and conversational AI frameworks. It’s equally important to note that building a contextual assistant requires machine learning, real user conversations, sound software engineering principles, best standards and practices around continuous improvement and continuous deployment (CI/CD), and tooling that supports these workflows.
All of that is to say that we’re at an exciting time for conversational AI to be the next computational platform of choice for companies and enterprises to improve products, offer personalized and curated customer service, and see real results. Conversational AI hasn’t been solved yet, but what’s promising is the pace of innovation and the level of discourse in this field.
About the Author
Mady Mantha is a Senior Technical Evangelist at Rasa. Mady studied Computer Science, Physics, and International Politics at Georgetown University. She has years of experience building ML-driven products and services for think tanks, enterprises, and startups. Mady is a space enthusiast.