The recent Ai4 2023 conference featured a talk by Hussein Mehanna of Cruise titled "How Autonomous Vehicles Will Inform and Improve AI Model Testing." Some key takeaways are that systems should handle the "long tail," developers should measure model output quality, and developers should push their systems to fail.
Mehanna, senior VP and head of AI/ML at Cruise, began with the problem statement that while generative AI has great potential, its output is often unreliable; for example, language models are subject to hallucination. However, he believes that his company's experience with deploying autonomous vehicles offers several lessons which can help improve the reliability of generative AI. Mehanna began by recounting some metrics about Cruise's business---their cars have driven autonomously more than 3 million miles---then played a short video showing autonomous vehicles encountering and safely navigating several unexpected situations, such as pedestrians or cyclists darting in front of the vehicles.
Mehanna then shared several use cases of generative AI at Cruise. The company generates synthetic images to use as training data for their autonomous driving models; they generate both single frame scenes for their perception models as well as "adversarial" scenarios testing the autonomous behavior of the vehicle. In the latter instance, Mehanna gave the example of a pedestrian darting in front of the vehicle: with generative AI, Cruise can create a variation of that scenario where the pedestrian trips and falls. Mehnna also said that the vehicle software uses an equivalent of a language model to predict the motion of other vehicles.
He then delved into lessons for making generative AI more reliable. The first lesson is to handle the long tail; that is, to have an explicit strategy for when the model encounters very rare cases. He said that there is a misconception that human intelligence is more robust than AI because it has a better understanding of the world. Instead, he suggested that when encountered with an unexpected situation when driving, humans change their behavior, becoming more cautious. The key, then, is to know when the AI model encounters epistemic uncertainty; that is, to "know when they don't know."
The second lesson is to measure the quality of the model's output. Mehanna admitted this is "much easier said than done," but recommended against using simple aggregates. He gave the example of Cruise measuring their vehicles' performance around pedestrians. On average, the performance is excellent; however, when measured by cohorts, that is, by age group of pedestrians, performance in the presence of pedestrian children is not good. He noted that it may take multiple iterations to find good quality measures. He also suggested that in many cases, so-called bias in AI models is actually a measurement problem of looking at aggregates instead of cohorts.
Finally, Mehanna encouraged developers to "push your system to the limits" and observe that quality metric in a lot of very different use cases. He mentioned that Cruise generates synthetic data for adversarial testing of their models. In these scenarios, they have control over all the objects and their behavior; they can, for example, add a pedestrian or make a human-driven car stop suddenly.
Mehanna concluded by saying, "You need to build a system of trust so that you can deploy generative AI---or any AI system---safely. And I believe we can learn a lot from autonomous vehicles."