Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Bots Are Coming! Approaches for Testing Conversational Interfaces

Bots Are Coming! Approaches for Testing Conversational Interfaces

Key Takeaways

  •  Testing Conversational Interface calls for a systemic approach, as they are not one single layer or system, but a collage of interacting components and services  
  • Tests should be ready to embrace nature’s non-linearity and language nuances
  • Lean on your API testing skills when approaching conversational interfaces; they are API heavy!
  • Conversations are flexible, and thus systems need to get adaptiveness from somewhere. Such interfaces are mostly supported by AI/ML components, so look out for AI components in the system
  • Testing conversational interfaces is more about user experience and how accommodating the system is for users, than about compliance with a strict set of requirements

Last year’s Christmas season shopping charts were topped by smart speakers, smart assistants and others alike. What do they have in common? They all are voice based computing interfaces, and with them a new wave of applications (call them skills, routines, actions) is likely to come. All of these need testing with an adapted approached, suited for their specificity and context. As it is all the time with technology, some things need to be adapted (test strategy, testing approach, validation criteria) while others can be re-used (e.g. API testing approaches and tools), and last but not least, some need learning new things (e.g. testing artificial intelligence models and components).

This article accompanies my talk Conversational Interfaces - Testing the Bots!that I presented at the European Testing Conference 2019, in Valencia. 

How Conversational Interfaces Have Developed over the Years

Conversational interfaces are not that new, or so I like to believe. If I were to define them formally, they represent any form of computer user interface based on natural language (be it written or spoken). People have envisioned such interfaces since a couple of decades ago, to say the least, and for example think of HAL 9000, imagined for a 1968 movie; to me that is one of the earliest examples of conversational interfaces, a voice assistant.

Movies are nice and good, but we all know that many times movies present things in a wishful manner, and so was the case with interfaces like HAL 9000. This is the case as conversational interfaces, in order to be successful, need as little interaction friction as possible. In other words, conversational interfaces require systems to become both ubiquitous, as well as powerful enough, to sustain the “translation” effort.

Over the years computing systems became smaller and smaller, while also increasing their processing power, and at the same time more and more connected. This allows us to have powerful enough computing elements in our pocket or in somewhere in the house or office, small enough not to bother us. I am pretty sure this is not new to any of the readers, as we all have been around an IoT conversation or in a “connected things” scenario. These are also manifestations of the same driving forces that enable conversational interfaces to happen, and many times they support and complement each other. 

Recent years provided the perfect environment for conversational interfaces to really become a thing, as more and more people have increasingly powerful mobile phones. This is also supported by the progress made by Artificial Intelligence, which is needed for reliable speech recognition and processing. 

Over recent years, conversational interfaces first became real (more than a movie prop) and then more and more refined. Having an increasingly connected world also helped, as in most cases conversational interfaces provide the entry point for smart “things” & appliances to interact with us. 

The Main Challenges in Testing Conversational Interfaces

Challenges? There are plenty, and from various angles. 

Programming and alongside testing evolved on the premises of clear and unambiguous logic. In this context, input values are clear and unequivocal. At the same time, natural language is far from unequivocal, and much more flexible in terms of syntax and semantics. This is where one of the most important challenges for testing conversational interfaces comes from. When testing such interfaces, natural language is the input and we humans really love having alternatives and love our synonyms and our expressions. Testing in this context moves from pure logic to something close to fuzzy logic and clouds of probabilities.

As they are intended to provide a natural interaction, testing conversational interfaces also requires a great deal of empathy and understanding of the human society and ways of interacting. In this area, I would include cultural aspects, including paraverbal aspects of speech (that is all communication happening beside the spoken message, encoded in voice modulation and level). These elements provide an additional level of complexity and many times the person doing the testing work needs to consider such aspects. I believe it’s fair to say that testing a conversational interface can be also be seen as tuning, so that it passes a Turing test.

Another challenge faced when testing such interfaces is the distributed architecture of systems. Most times a system with such an interface is a collection of distributed services, glued together via API calls. This is not something new, but it can become complex rather quickly, and at the same time it can take away the focus of the testing effort, ending up testing other systems and interactions than initially intended.

Consistency of input is also something that can be considered a challenge. This comes from the fact that in order to minimize variability, one needs to ensure that the input is as much as possible reproducible, while voice is something that can vary easily.

Last but not least important on the challenges list, I believe, is performance, as this can be tricky to pin. What makes an interface performant? It’s a mix between things like ease of use, flexibility, latency, and scalability, to mention just some of the aspects that can be faces of performance and that can influence user perception of the overall conversational interface and its purpose.

How We Can Deal with These Challenges

Firstly, I believe it is important to let go of the comfort zone, accepting uncertainty and complexity. This means operating with intervals of confidence where we were used to operating with fixed and well-defined values. This is needed, as words and phrases can have multiple meanings and the same intend (a fundamental building block of a conversational interface) can be modeled and expressed in more than one way almost all the time. 

Secondly, in my case, was building a test strategy that includes a clear vision and mission statement, along with an ever expanding map of building blocks and dependencies. This allows testers to not lose track of their goal, ending up investing a lot of time testing parts of the system that are either not so vital or not even under their control. Having a test strategy defined helped me better understand what I was testing and what type of system I was dealing with. This is relevant since interfaces based on static rules are to be approached differently than interfaces based on AI/ML models, like those using natural language processing services or sentiment analysis as controlling variable. 

Getting technical and chasing built-in testability of the system under test also helps a lot. For me, this included having lots of inspection and logging points, uncoupled interfaces that are programmatically accessible. As I mention in my talk, API testing tools and approaches are really useful when it comes to testing conversational interfaces. This means that tools like Postman, Swagger and lots of scripting can help ensure repeatable tests and decent visibility for exploring the whole system. When saying, this I have in mind using Postman collections for capturing interaction scenarios in an easy-to-share-with-the-whole-team manner, and then executing these collections with variations in both input and sequence through Python scripts. 

Earlier I mentioned the fuzziness aspect, and one approach I took to solve this challenge is focusing on transitions, rather than on states. The goal of any interface is guiding the user through a flow, towards a goal. In this view, individual states, or stops, in the flow are not as important as are the transitions between various states. I learned pretty soon that when testing chatbot and voice assistant interfaces, ensuring transitions are happening is more important than having the states covered. This came from personal experience, where I encountered scenarios where I, as a user, was blocked in a dead end, having to leave the app, because my input did not match any of the input values expected by the app (while also the app did not provide guiding info on what was expected). In scenarios like this I realised that when using conversational interfaces, the user experience benefits from having options for getting from one state to another.

What We Learned

Testing is fun and continuous learning! That is the first lesson I got and it has been reinforced over the years, as I get to test more and more diverse types of systems. More specifically related to testing conversational interfaces, I learned that tools should be adapted to the context, that exploration involves a lot of technical aspects, and that general systems thinking is a useful field of study. 

One lesson I learned while testing a conversational interface was to always ask myself, “Is this component I am testing now under my team’s control, or am I providing free testing to some other service provider?” That was relevant, as I realised I invested a significant amount of time into testing some aspects of the conversational interface that were not under our control. 

Another lesson I got early on was to not be intimidated by AI/ML components. I treat them with respect and understand that underneath there are some models that can be handled as black boxes, with some good level of determinism. This comes from understanding that the cutoff value (the threshold for considering a value belonging or not belonging to a classification domain) is something that every AI/ML supported components have, and operating across that border determined their behaviour. From a testing perspective, this means building equivalence partitions for each side of the decision point and testing as many as possible values from each partition. For example, when having a greeting I usually try at least ten options on both positive and negative sides, and at the same time, for each intent I try to come up with as many synonyms and equivalent phrases. This way I try to cover the variability aspect of an AI/ML expected input. 

The Skills Needed to Test Conversational Interfaces and How to Develop Them

As a starting point, I would say that starting with the API testing skill set is a good option. In my view, this is valid, since conversational interfaces can be assimilated from a technical perspective with an elaborate API calls orchestration. 

Empathy is another important element, as natural language is a much more intimate way of interacting (v.s.  a GUI), and putting oneself in the shoes of the intended user as well as for significant agents around the user (e.g. parents for child-oriented applications) requires a significant level of empathy. At the same time, as a tester, one should also look around the indicated path to ensure that the app is friendly both to the users, as well as to the people who should run and support it throughout its lifecycle (e.g. support agents, back-end operators). This is important also because if one starts from the API testing skill set, many times the users in that context are systems or developers, while conversational interfaces are most of the time aimed at a not-so-technical audience.

Modelling skills are also needed, so that one can create a map of the intents, transitions and associated triggers that will together define the conversational interface flow. 

I really hope that following this piece, your interest around conversational interfaces will be higher. It’s good to realise that things change, and I believe #testing people are prepared for such applications. If I were to pick the highlights of this article, they would be the need for empathy and flexibility, alongside the positive news that API testing tools are still quite useful even in this new context. 

About the Author

Lucian Adrian is a software tester currently enjoying working with new technologies at R/GA. His experience includes testing a wide range of software applications, ranging from networking & embedded software, to mobile apps and alternative interfaces (e.g. Voice, AR, VR, mixed reality). His latest interest is in exploring the interaction between users, computer interfaces, and data driven methods. You can connect on Twitter (@lucianadrian) or through his blog.

Rate this Article