Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Data-driven Development in the Automotive Field

Data-driven Development in the Automotive Field



Toshika Srivastava offers insight into how they in the automotive field are developing products with data and what their challenges are.


Toshika Srivastava is an AI technical team lead at Audi responsible for AI products development for safety relevant automated driving functions. The scope of her work focuses on end-to-end development and deployment of safety-relevant AI modules with Systematic process, method, and tools.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Srivastava: I'm going to give a talk on data-driven development for automotive domain. I'm Toshika Srivastava, technical lead for data-driven development for autonomous driving. My prime focus areas are function and feature development, process method and tool chain development to enable these features end-to-end. Also, I look into the safety concept for machine learning module, which can then help in verifying and validate these modules inside the safety development environment.

Basically, I'm employed by Audi. Recently if you have heard that the Volkswagen Group have started a new company, which is called Car.Software Org who is responsible for building the software solution for the mobility industry, especially for the Volkswagen Group brands like Audi, Porsche, Volkswagen. Right now in my current role, I'm also supporting this organization from the start so that we can have a company together who focus on the software development.


I'm going to talk about the basis or introduction to the autonomous world for those who have not been into this area. Then I will move towards the data-driven development concept. Then, I will talk about the data analytics platform one can use within their organization.

Autonomous Driving and Driving Assistance System

If you are new to this concept, autonomous driving and driving assistance system, it offers multiple features ranging from comfortable intelligent driving, active driving, and also range up to the safety because safety is always one of the concerns when we are delivering cars to our customer. This whole autonomous driving is divided into multiple levels. This is how the automotive company or automotive industries are certifying their driving software stacks. It's basically from level 0 to level 5. Level 0 is normally we are driving in our everyday life, where we have no assistant and we are responsible for everything. Then in some of the car you also see features where you can trigger the speed limits. Then based on the sensors, either if you have an obstacle in front of you, you reduce your speed or your car reduces speed, or if there is no obstacle in front of you then the car accelerates to reach that maximum speed automatically. Those come under the level 1 or assisted driving. The more you move towards level 5, the responsibilities are also moving from human to the machine. Level 5 is fully autonomous where a driver is not there and machine is responsible for end to end functionality of the driving software. In all these levels, my area of work lies between level 3 and level 4, which normally, I call it highly automated driving and fully automated driving, where basically our KPIs are on the human and machine in between based on the reaction time we set for these concepts.

Data-Driven Development

That was introduction to the autonomous driving and autonomous level. I would like to move towards the data-driven development. I will start with the basic, what is data-driven development? Developing features, tools, methods, service based on the knowledge provided by the huge amount of data which you have stored within your organization. From my understanding, that's basically the data-driven development. You learn features, patterns from your databases, in order to do a certain functionality. One of the efficient ways which commonly we know in our current generation to learn from the data is using AI and machine learning methodologies. As we are developing a software based on those methodologies, they also come under the product development cycle. Here, I have plotted an image for you where you can see how the data-driven development product cycle looks like. It's not basically a one step process. You learn continuously throughout the whole development cycle, until you reach a certain maturity or you fulfill all the requirements which are needed for your function to provide. In this case, this is a structure. Every company can have their own different structure. This is one way we are doing it in our team. We have a test car which continuously records data from the driving environment, then we collect the data, clean it. We provide data enrichment by adding meta information, additional information with the data file. Then we ingest this data to the data center, or you can also call it the data management platforms. From there, this whole data-driven development, CI/CD, continuous integration and continuous development cycles start, basically. Then each team can take this data. Depending on the area they are working on, they can prepare the data. Because, for example, if you're working on perception, you prepare the data differently. If you are working for planning, you prepare the data differently. Then you label the data according to your specific use case. Then you train it, you test it. Then once this whole process is going on and you reach certain KPIs or metrics, or maturity level, then either you deploy your machine learning module inside a code centric environment, or a platform in the cloud structure, or in any hardware you are using for your integration. Then you test it. A different type of testing exists, like hardware and loops, software and loop, and stuff like this. Then you release this module into the car even, or to the platform, which is hosted on the cloud in order to again collect the data based on the deviation KPIs of your function, or you can just deploy a simple corner case, or edge case detection in your algorithm, which then can run in the car to collect efficient data.

Hidden Requirements/Challenges for the Data Development Lifecycle

Then, I would like to move towards the hidden requirements and challenges for the data-driven development lifecycle. As this is a new methodology, it brings a new development lifecycle for the product. You always have to bridge the gap between your classical development world and your data-driven development world. Then, I would like to talk about the safety process, because this is important particularly in my domain, where I have to always make sure that the modules or the functionality I'm producing is safe to deploy into the customer fleets, or the customer car, test cars and stuff. Then you always have to make sure that there are requirements, you always develop according to those otherwise you will end up in such an infinite space of development, where you never reach completeness of your algorithm. Then you have also challenges like scalability, reliability. This is the picture I wanted to show here. It's basically one of the research papers from Google published in 2014, where they showed that the machine learning module is a very small part of the whole system. You need a lot of infrastructure, tooling, configuration management, resource management around your machine learning module in order to do an efficient and scalable machine learning or a data-driven development for any product. Here, I always take into reference, and this was actually also the motive, why we also moved in the direction of spending effort in the infrastructure and tooling, and stuff like this.

AI Strategies for Automated Driving

There are challenges, there are requirements, but still inside automated driving, we see the potential of machine learning or data-driven approaches to be a huge success because you cannot do one-to-one, or it's impossible or next to impossible to do one-to-one rule development for autonomous driving. There are two strategies companies are using within their whole autonomous driving stack. The first one on the top is deep driving, which does end-to-end. You put a raw sensor, and you get the lateral and longitudinal control which you then provide to your car to do the autonomous driving. On the bottom, you see another approach which is more like a modular approach, where you break down this end-to-end system into smaller modules. Basically, specifying from perception to the fusion to interpretation to prediction, and then to the planning. Then you provide this lateral and longitudinal control to your car, which then can drive autonomously based on this information. One can choose any of these depending on the organization. I'm not recommending. There could be many other approaches also, but these are the strategies most commonly used within the automated driving world.

Use Case Example - Lane Change

Here, I would like to give an example of one of our use case which we are developing as a concept in our team, which is called lane change. You see, on the top you have a driving environment. There's a lane. There is cars in the field, and this is coming out from the camera and other sensors. Then on the bottom is a prediction from the neural network. I would like to show this video to you, and meanwhile, I will tell what exactly is happening. Here you see, this other car, and we are predicting if the car is going to do a lane change or drive straight. Lane change basically has three classes: straight, left, right. Now you see the car has moved to the right and then the prediction was also at the right. Then you are predicting for every vehicle, which direction they are driving. If they're driving straight or doing a lane change or not. This is a simple example I wanted to show you so that you get an idea which modules we are developing using our data-driven development lifecycles.

Areas of Activity for Data-Driven Development

There are other areas also for the data-driven development, not only the function and feature. You can use machine learning or neural network based approaches into data management, data science, infrastructure, simulation and assessment, detection of anomalies, and stuff like this, in order to provide the capability of deep neural network, neural network, machine learning or AI, whatever you called, to solve the problem efficiently.

Data Analytics Platform

I would like to talk about the data analytics platform because I feel that's also one of the things I wanted to go with in my talk. I don't want to cover everything in this time and focus my energy into one direction. There, I see potential for the data science environment or how the data analytics tool we have set up, which algorithm we are using in order to get the information. I would like to go towards this part from the overall activity we do with the data-driven development.

Sensors and Data Collection in the Vehicle

Before starting from this analytics platform, I wanted to give a bit of information about why we need a smart analytics platform in order to analyze the data, in order to plot the data, in order to get a statistic out of the data. If you see the example here, there's a car having, for example, 25 environmental sensors, coming out from the camera, radar, ultrasonic, LIDAR, maps, stuff like this. All these sensors collect data at a different timestamp or different frequencies, and also have different information in different format. If I take an example here, how much data a development vehicle is recording per second is, 8 gigabytes, for example, this is just a number. From here you can imagine the more sensors you add within your autonomous driving, how much data you collect within 1 second, and how much data you need to store, you need to analyze. These fleets are recording 24 by 7, the data. The task of the data analytics platform is to provide an efficient interface to access those data, to search those data, and also to visualize those data and to add more information with the data recorded with these cars.

Data Structure and Metadata Format

How the data structure and the metadata format looks like. Here you see there are three measurements, for example, measurement A, B, and C. Those you can say are driving campaigns. Then for each driving campaign you have recorded different scenarios. File 1 recorded one scenario, file 2 recorded a different scenario, file 3, and so on. This is just an example format. On the row, you see there are different names, called Flexray, Video, ECU-X, ECU-Y, ECU-Z, and all. These are basically just the type of signals you are getting within each recording. If I talk about the metadata, so in a high level, when you record a measurement, you will get basic information with the file. You have when the data was recorded, at which date, which time. What was the vehicle ID? What was the description of the core data? The stream, the signals you use within the data files. Which were the software versions or the sensor versions that were used when you recorded this data? Also, it comes with the different type of labels you can store with the data, like events, sequences, and stuff like this. If you have this kind of data, and think about this 8 gigabyte of data recording per second, how much data you have. You don't want to visualize all the data because sometimes within a second, nothing changed, and it is just repetitive data. What you do, you take samples from each file, which basically reduce your timestamps, so you don't have to have repetitive data. You can select, if there are some errors or not. You can visualize this on a small set of data coming out from each file. This also basically reduce the content for the platform and environment on the dashboard you are using into your analytics platform. This is one of the ways you can use it to make your data format or data structure visualization within your analytics platform to be efficient. There could be many other ways.

Framework for Data Science/Analysis

This is basically the framework for the data science and data analysis. Because if you have a team or if you're working for autonomous driving, you have a different functions team or a different features team, and every feature team or every function team have a different requirement. For example, if someone is working in the perception team, wants to analyze their data differently than if someone is working in the planning team, because they need different type of information. You need a model of workflow when you want to do a function or a feature specific data analysis. Then you can always have standard algorithms in order to select the data, filter the data, resize the data, bind the data, and all. Then, for plotting or for visualization, you can use interactive charts, plots, maps, so that you know where the data was recorded, which other signals are there, their distributions, and all. In order to access the data with the data management platform more smartly or efficiently, you need to have a structured data management system there with a Smart Search option. Here, there's a flow going from the bottom to the top, you see, that's a test field car who requests the data just, as I said, 8 gigabytes per second for example. Ingest the data to the data management or data storage. Then you select those data. You define those data, what the data is, if it's meta information and stuff like this. Then you clean those data. You prepare the data accordingly. If you need, for example, to have a chance of adding more information from some advanced modeling like mathematical and physical model, or if you have additional information coming out from maps, from features you want to add also with those data files, you can do that in this step. Then you add or you append this information to the meta information, which basically then you have reduced data because you have already cleaned it., and you remove all the anomalies and stuff like this. Then you transform this data into a model, you can use within the team and you can interpret. Then you create a new knowledge based on these datasets, which is then really helpful for the function team and feature team in order to develop their functions efficiently and properly. This is basically a sample framework for data science or for data analysis. One can think about it, or one can implement it.

Visualization of Statistical Results and Breakdowns for Evaluation

If you ask me how the visualization looks like, you can have a different level of visualization. For example here, on the left, you see the evaluation of driving campaigns. Because if you're thinking about driving campaigns, it's more like, this car is recording data into this highway, in this country, in this city, and stuff like this. The information is huge. You have a visualization like a map perspective, so that you can say, which area and which roads you have driven. Then on the right you see visualization of the results based on the scenario. Then you go more in detail to the signals, which the recording campaigns are producing, like velocity, object list, and stuff like this. Then you can visualize it differently. This can be one of the two possible examples or visualizations you can do in your platform, basically.

Data Enrichment

Then, I would like to talk about the data enrichment because this is one of the important thing you want to have as information within your data files, so that this is highly informative. You can generate so many statistics out of it, and can also use within the feature and the function team or the product teams in order to make meaningful test cases, test scenarios, development scenarios, and stuff like this. Here, you see on the right is a basic, you record the data. You ingest the data to the data management platform. Then you can either visualize the data to your working space, for example, in my laptop, or you can also upload this data to the cloud, and then connect to the cloud to visualize it. You can do multiple things there. Then, what you can do is in this cloud or when the data is stored into the data management, you can add information. Information about the vehicle, information about the test setup. Why this data was recorded, for which customers, like a parking project or is it a highway pilot project, or urban scenario project and stuff like this. Then you add this information. Then you also add metadata information, for example, semantic segmentation. You basically do all kinds of mapping and correlation for the data in this process. Then the top one is more scenario. You create different models, different mathematical or different machine learning models, where you generate scenarios from the data streams. Then you tag them so that you can use this information when you're doing function development or function testing in the end. Then you add this information together with your data files. Then you can then connect it to the scenario editor. Then you can generate different variations over the scenarios, or you add this information for your testing results. Here, I have plotted an increasing effort chart because the more you go from vehicle setup, test setup to scenario, you need more effort or more advancement in methodology, or technical solution in order to bring the meaningful information content out from these data files.

Finding Unknown Unknowns - Scene Grid Recurrent Neural Network

From here, I would like to talk about some of the methods one can use in the top where you need most of the effort, in order to gather the scenario information from the data file where you don't know exactly what's happening just from the raw data stream. I would like to give an insider glimpse about two possible methods you can use. The first one is scene grid recurrent neural network, which is one of the ways to finding unknown unknowns from your data stream. Here, you have driving data. Basically, at each timestamp, you have information about your objects, about traffic light, traffic sign, lane marking, and stuff like this. This is what normal driving data looks like. Then you plot this information to a scene grid, also change timespace. Then you extract the features like spatial features using convolution neural network. Then from there you extract the temporal features using LSTMs and time recurrent networks. From there, when you have a feature space, you can do a different task. Either you cluster those information, you use this to label it. You can also use this feature space in order to detect the anomaly within the data stream, so that you can clean the data, and stuff like this. This is like one of the ways you can find these unknown unknowns from your data stream, which then can be used in multiple ways.

Finding Unknown Unknowns from Embeddings

The next one I wanted to talk about is finding unknown unknowns from the embeddings. Here, what you do is you have an individual information channel for each driving state. Then you create a permutation of these bin channels, which is resolved to a driving state. There, you have input, then you produce One-Hot Encoding of those inputs. Then you have an input layer, a hidden layer, output layer. There, you basically extract the information about the driving states from the embeddings of the hidden layer. For example, here, you see, if the driving state occurs in the same context, the distance between the two states will be really close to each other. Then, if the driving state does not occur in the same context, then the distance will be really large. This way, you can find meaningful information from the embeddings, the intermediate layer or hidden layer, to tag and to know about the scenarios within your data streams. Those are the two example methods. We are using it in one of our data analytics platform in order to find the unknown unknowns.

Cloud Workplace (Data Science, Desktop Dashboards)

I just wanted to give you how a cloud workspace, data science dashboards look like from a visual perspective. You have a Jupyter Notebook where you can extract data. Do some analysis on the data, when the data was recorded, what was the location, how big the data is, and stuff like this. Then you can create all the visualizations of the plot and statistic, how many timestamps you have in your data, how the distribution looks like. When you have edge cases, corner cases. In which part of the world the data was recorded. This is just a visual of how the visualization or the data science the data analytics platform looks like on workplace basis.


I have talked about the autonomous world, how it looks like with a different level. Then I talked about what is data-driven development challenges, and stuff like this. In the end, I just focus my talk on the data analytics platform, which I personally believe is the core, because if you don't know how to efficiently or smartly access your data for your functions, then you will always be in a small scale prototypical solution. For scalability, it is very important to have a good data analytics platform within your organization or within your team. With this, I just wanted to give examples of the different levels of information you can have, how you can plot, how you can do certain tasks within this platform.


See more presentations with transcripts


Recorded at:

Jul 11, 2021