Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Testing Machine Learning: Insight and Experience from Using Simulators to Test Trained Functionality

Testing Machine Learning: Insight and Experience from Using Simulators to Test Trained Functionality

Key Takeaways

  • Testing machine learning (ML) applications is like testing with a Black Box mentality. It’s hard to understand and explain a decision made by trained functionality, even when you look at the internal structure of the model.
  • The distribution of training and testing data sets defines the functionality; you can partition the data to represent all defined valid testing scenarios combined with functionally defined scenarios.
  • With the Operational Design Domain (ODD), you can define requirements for your ML function. When you find behavior that does not match your expectations, you must figure out whether you are inside or outside your ODD.
  • Simulators support annotation, for instance, by identifying and separating objects in an image of training data. Simulators are a tool-driven aid to test scenarios for which we can not produce “real world” data and can speed up testing execution by enabling control of the environment (traffic, weather, infrastructure, etc.).
  • Knowledge and experience testing traditional code are valuable when working with ML applications. Understanding black box test techniques and domain knowledge is valuable when testing these applications.

When new technology is introduced, we must explore how to test it. Doing research in the verification and validation of trained functionality and machine-learned functions, and applying the research in testing, I gained insights and experience in testing machine-learning applications that I share in this article.

For machine learning applications, the code itself isn’t interesting. Rather than the functionality or functions being constructed by complex and massive code bases, Machine Learning applications consist of a few lines of code, with complex networks of weighted data points that form the implementation. The data used in training is where the functionality is ultimately defined, and that is where you will find your issues and bugs. Data is the key to any trained functionality.

When testing machine learning systems, we must apply existing test processes and methods differently. Testing should be independent and have a fresh approach to any code or functionality. We also need to create independent test sets and not rely on the verification part in training.

And we must address the problems with version handling. The concept and principle of CACE (Change Anything, Change Everything) exists in machine learning. ML functionality is to be considered opaque, so in a way, it is a black box.

It’s at least very time-consuming and hard to understand exactly how an outcome is decided. It traces back to the distribution of training data and weights used when training, and the type of network. From a tester’s perspective, it’s better to consider the function as a super black box. This also means that if we, for some reason, decide to retrain, it’s easier to think of the new model as version 2.0 rather than 1.2.

Testing machine learning functionality

In testing machine learning (ML), the functionality isn’t that interesting. The code becomes sort of irrelevant. For an old-school tester, the code and function is "the way." With ML, the functionality you verify or test is largely based on the training data. When we shift focus from code to training data, unit tests or "close to the code" will result in testing the data used to train the functionality, instead of testing individual code statements or functions.  

When moving up through traditional test levels, simulations and other tools will be useful to test or validate the functionality. But any problems we find here (in a simulator) or in production (your launched system or autonomous vehicle) will need to be fixed by altering the training data in some structured way.

When testing trained functionality, knowing your training data will always be important. Examining the distribution and composition of the training data can replace unit testing. Performing reviews (static testing) of the distribution can be considered as early testing in the same way as code review of a review of requirements. Review your data set early to identify unwanted distribution or bias. This way, you can avoid poor performance in your functionality at later stages.

When working and testing trained functionality, another aspect that sets it apart from "traditional" code and testing activities is that each change or bugfix will give you a new function. It’s unlike traditional testing, where you can isolate a fix and do a retest plus some regression in nearby functional areas. You will need to consider it a completely new version of the function and probably want to do a full run of your test suite. Of course, you can do clever risk management and so on to speed up the process, and with the help of tools and simulators, the process can be more efficient.

Defining or designing ML systems

One of the first things that comes to mind in defining ML systems is requirements. Traditional ways to describe and specify requirements don’t work well for trained functionality. In the projects I have worked on, we have used Operational Design Domain (ODD) to define the context within which the model shall function.

You can think of ODD as a way to define requirements for your ML function. So, for autonomous driving, we divide it into categories, for example:


  • Junctions
  • Rural or city
  • Road/off-road


  • Weather (rain, fog ...)
  • Lighting (Day, night ...)


  • Traffic (pedestrians, cars, bikes ...)
  • Your car or your function in the context

When you find behavior that does not match your expectations, you must figure out if you are inside or outside your ODD. If you find that the behavior is outside, you might want to consider this to be a bug or anomaly to investigate further.

The distribution of the training data decides much of the trained functionality's performance. With this in mind, a "bug fix" is a change in the training data distribution rather than changing lines of code.

Data is key

The distribution of training and testing data sets is crucial. This is where the functionality is somewhat defined. So, how can we test and ensure we have all the important data elements to train an ML model with the correct performance?

We need to look at distribution, of course. The hard part will be context (such as cultural or national differences) and bias. This is where QA can play a part as independent actors raising concerns about the training data or other datasets. An outside perspective is a good thing. Partitioning the data to ensure that all valid scenarios are represented can be a good start. If we are to train a classifier, all valid classes (and probably several invalid ones) need to be represented. I like to think of it as ensuring that all equivalent partitions are represented, valid, and non-valid.

For a fruit classifier, we must cover apples, pears, bananas, and so forth. We also need to consider different degrees of ripeness and shapes. With a pre-trained model, the importance of an independent training set is even greater. This is an opportunity to apply an unbiased layer to validate the model. Keeping any defined ODD in the loop when doing this testing is essential. The ODD will be the boundaries we are testing and help us argue for functional correctness or unwanted behavior.

Another challenge is annotation; it might become a big issue if done subjectively. It depends on what your model will be used for and how exact you need to be in your training and validation data, but it’s important to be accurate, for instance, where the road in an image ends and where the pavement starts.

Let’s assume you would like to use a model for computer vision and scan or categorize documents. It will be important to distinguish between say Å and Ä in the Swedish alphabet. When training the model, we must annotate if there is one dot or two above the A.

Let’s consider a traffic scenario for functionality connected to an autonomous drive. The image needs to be annotated so that we can distinguish between pedestrians, roads, other cars, and so on.

Annotated image where all objects are segmented by color or contrast.

From within the simulator

Source: Data synthetization for verification and validation of Machine Learning based systems

Simulators help when it comes to annotation, both in creating training data and for testing. They are a tool-driven aid to generate scenarios for which we can not produce "real world" data and auto annotation. For instance, I might need to create rain, fog, snow, and so forth, testing autonomous driving. Simulators can aid us in this.

Simulators also give us, if needed, a ground truth in the annotations to help speed up testing. The annotated ground truth provides us with an oracle as to what in the picture is the sky, what is grass, exactly which part of the picture is a pedestrian, and so on.

Most simulators for testing computer vision or autonomous drive have filters or modes. They auto-annotate your scenarios, providing a ground truth or oracle for the different components. Using other sensors besides a camera, such as radar or LiDAR, the simulator can give you point clouds or semantic information to use as bases for testing.

Using simulators can also help when looking for corner cases more efficiently. For instance, we vectorized our scenarios and then automated the search for scenarios where the model failed. Through some simple automation, we can take a base scenario for a simulator and then, for each run, slightly change how much rain or light of day there is to gradually look for combinations of snow, traffic, and so on that will cause the model to make a wrong prediction. In a simulator, this is easily automated; out on the street, it will be much harder.

Research projects in testing machine learning

The insights and experiences mentioned in this article came from research projects. The projects looked into how to test machine learning functionality. The EU Commission and the Swedish government have funded these.

The team I worked with took part in three major studies. All three were connected to the verification and validation of trained functionality.

  • FramTest focuses on identifying challenges in the industry today.
  • SMILE focuses on processes and methods to define and defend a safety case.
  • Valu3s focuses on the use of simulators for testing trained functionality.

#1 FramTest - "Test methodologies of the future: Needs and requirements"

The FramTest project (in Swedish) examined "how companies attack the ML problem today." We did a literature study exploring the state of the art within the field. The finding was that despite many papers on developing machine learning functionality, very few consider verification and validation. To follow up on the academic papers, we did 12 in-depth interviews with testers and leaders in Sweden around the topic.

The results from the interviews can be broken down into:

  • Lack of test data
  • No annotation
  • Culture issues

#2 SMILE - "Safety analysis and verification/validation of system based on machine learning"

In the SMILE study, we looked at scenarios, architecture, and, most importantly, data collection. We looked at requirements and ODD (Operational Design Domain) handling. Comparing standards like ISO 26262 with, at the time, the new standard, ISO 21448 (SOTIF). Also, we evaluated the EU list for trustworthy AI and how that compared to the above-mentioned standards.

We learned that how you gather data and how you set up requirements matter a great deal. As discussed earlier, data collection needs to be relevant for the ODD.

In this project, we explored how the distribution and composition of the training data need to evolve/change when a corner case or unwanted behavior has been found. Retraining the model’s outer layers might be enough, depending on your model and function. But in most cases, retraining the model is needed to ensure that the unwanted behavior is handled, and this will trigger more test activities.

#3 Valu3s - "Verification and Validation of Automated Systems’ Safety and Security"

We have conducted a 3.5-year EU-funded project called Valu3s, using simulators to speed up the maturity of ML functionality. My takeaway is that using a simulated test environment is crucial if you want to do any sort of automation, corner case searches, or scenario-based testing.

Examples from simulators used in Valu3s project

Source: Efficient and Effective Generation of Test Cases for Pedestrian Detection

The pictures here are examples from a scenario we used when automating tests. The image on the left describes the route a pedestrian takes to cross the street, and the picture on the right has a car with an autonomous driving model connected to it. By vectorizing the scenario into a structured file, we modified external conditions to find combinations that made the car hit the pedestrian. So, with automation iterating over the different parameters, such as lighting, rain, time of day, etc., we can find unwanted behavior.

One of the use cases in Valu3s was connected to traffic surveillance; in addition, we looked into number plate identification. To test this, we developed a tool based on ML to generate number plates and insert them on vehicles in a simulator.

Example of numberplate testing in Valu3s

Source: Verification and Validation of machine learning based License plate detection algorithms

The tools take ordinary strings from a file and convert them into images inserted in a defined scenario on a defined car. The simulator allows us to control and change environmental parameters of interest, and the number plate tool enables us to try any number plate combination of interest. It also helps when testing number plates from other countries.

To summarize

We, as testers, have a skill, a view of what’s good and bad that’s still very valid. Our methods might shift, but an independent eye will be valuable down the line. If data and not the code define the functionality, our focus as software testers has to shift left in a new way, and our mindsets might need to change a bit.

Fixing a bug or unwanted behavior will result in a new version of your function, not a version 1.2 but rather a new function. I realized that to fix a bug, you need to change the dataset you train your model with rather than edit lines of code.

We can automate the testing of ML functions by slightly adapting our traditional software testing methods. This makes the step to test ML functionality smaller.

About the Author

Rate this Article