InfoQ Homepage Presentations Designing IoT Data Pipelines for Deep Observability

Designing IoT Data Pipelines for Deep Observability

Bookmarks

View Presentation

Speed:

31:16

Summary

Shrijeet Paliwal discusses how Tesla deals with large data ingestion and processing, the challenges with IoT data collecting and processing, and how to deal with them.

Bio

Shrijeet Paliwal is a founding member of the data platform team at Tesla, tasked to build a scalable, self-served, cost-efficient data platform that caters everything data at Tesla. He has his hands all the way from improving server provisioning & automation to writing multi-petabyte time series storage. Prior to Tesla, Shrijeet has contributed to data infrastructure at Pinterest and Rocket Fuel.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Shrijeet: I will introduce you to some cool data and flow challenges that you only see in IoT space. I'll start with a small problem that you might relate to. Tesla designed a supercharger network to enable a seamless, enjoyable road trip experience. It can be quite frustrating to drive to a supercharger only to notice that it's down. We provide real-time visibility into supercharger's health status. We receive telemetry from our superchargers that let us know when a post is occupied, and various other stats. Using this data, we can assess the health of our supercharger. On the surface, it's a tractable problem. There are many edge cases which may get novel. Some issues such as a broken connector, or an obstructed parking stall will prevent the vehicle from charging through that supercharger post. In some cases, a supercharger may be perfectly healthy, but due to extreme weather or some other reason, it might lose connectivity with the cloud and stop sending telemetry. The down detection algorithm, the algorithm which figures out where to route a user to has to employ many heuristics. It has to not only be ready for the superchargers, which are sending data in a perfect way, but it has to also account for the fact that data can be missing many times. While any algorithm which depends on data has to be prepared for missing data, but in case of IoT, it's not a rare event, it's actually quite common. It's problems like these, which makes IoT data space very unique to work with.

Background

My name is Shrijeet. I am part of the core data platform team at Tesla. My team is responsible for handling all mission critical data systems at Tesla. I'm here to tell you more about these unique challenges and some unique solutions.

IoT Data Pipeline from 10,000 Feet

Let's start with a 10,000 feet view of IoT data pipeline. Most of the devices, at least in case of Tesla, run a Linux operating system and provide local data storage, computation, and control while maintaining bidirectional communication with the cloud. The communication flows through an edge gateway that handles connectivity as well as security. The edge gateway integrates with the cloud, which has a Kafka cluster behind it to ingest large volumes of telemetry. The Kafka based Pub/Sub provides message durability, it decouples producers of data from the consumers of data, and allows sharing this telemetry across many downstream services.

Not Clickstream

At first glance, the picture that we just saw, it seems very similar to any other data pipeline we can imagine. The very first comparison that comes to mind is clickstream, data that we collect from ad networks, the Google's and the Facebook's of the world. We have come a long way in defining elegant design patterns for dealing with massive scale clickstream data that we observe with conventional internet systems. These design patterns work really well, because the underlying data producer is stable, and it is built with a common standard set of assumptions. You expect the network to be always reliable. You expect the traffic patterns to be more or less the same, and even if they change, even if there is a seasonality, after a period of time, even that can be predicted. Essentially, you trust the upstream to always behave well. The way you make sure that they behave well is putting multiple layers of redundancy. A conventional software service is deployed in data center, which is responsible to make sure this redundancy is there.

Uncertain, Erroneous, and Noisy

In contrast, for the most part, IoT devices are installed in a very hostile environment. In IoT, data acquisition can go partially dark for legitimate reasons, and thus, knowing whether it's a real problem is very subjective. IoT devices generally suffer from lack of resources: power, storage. Their computational and storage capabilities do not allow complex operations such as sanitization, putting a standard format, deduplication, and things like that. Finally, intermittent loss of connection in case of IoT with the cloud is relatively frequent.

Unbounded Heterogeneity

IoT devices are also incredibly heterogeneous, regarding basic communication protocols, data formats, and technologies. Furthermore, in case of IoT, if we divide data in two categories, one is diagnostic data, and the other category is something that is critical for the functionality of the device, the diagnostic data is always collected in best effort mode. Devices will choose to stop sending data if it comes in the way of providing critical functionality, for obvious reasons.

IoT Data - Design Patterns

With its unique challenges, IoT data engineering has modest overlap with conventional approaches. Over the years, we have developed a few targeted patterns to undertake these challenges. Let's look into some of those patterns.

Isolated Pipelines

The very first design pattern we use is isolation. We can find data pipelines along multiple dimensions. The obvious dimensions are business units. In the case of Tesla, we have end-to-end data flow for every business unit. The underlying infrastructure on which these data flows run might be shared. For example, we might share a Kafka cluster. We might share a Kubernetes setup. We might even share some databases at infrastructure level. At the application level, they are completely isolated. The next dimension, the obvious dimension to divide after business unit is products. If we have multiple product models, for example, if you have multiple models of vehicles, they all will have their own end-to-end data flows. We'll not mix data from Model A with Model B. The next dimension at which we all isolate is geography. Those were some of the obvious dimensions. Then there are some interesting dimensions such as transmission modes. Devices collect data periodically, and that's an obvious transmission mode where they're periodically sending some payload to the cloud. There are cases when you would want out-of-band collection of data. For example, a service technician wanting to know more about your device, and they can request an ad hoc pull, which can be more verbose. That's another place that we would not want periodic collection to interfere with out-of-band collection. While this strategy of isolating pipelines and data flows across multiple dimensions, allows us to limit the noise and allows us to reduce the problems of noisy neighbors and subsequently improve the data quality, the downside is that we have too many datasets and too many data pipelines to manage.

Deferred Complexity

The next commonly used pattern is deferred complexity. As noted, due to constrained resources, the devices cannot handle operations such as sanitization, putting some standard format. I keep saying standard format, because the data in many cases is proprietary. Deduplication is another thing that devices will never perform. They will always defer all these complex operations to the cloud. The device is always operating in a mode where it can push as much data as possible, as fast as it can, and in the tiniest possible payload that it can manage to send that data. On the cloud platform side, to restrain, to contain the heterogeneity and burstiness and noisiness of data, we try to explode the payloads that we receive, which are typically batched payloads. We get telemetry in batches. We try to explode regardless of which business unit or product this data is coming from. We try to employ some common strategies so that we can put sanitization, common standard formats, and some other standard cleaning steps in the first step of data collection itself. This slide is a demonstration of that as you zoom in into the Kubernetes magical black box that you're seeing there, that is running some real-time data pipelines that we have developed over a framework called Akka. We call this whole product, Kafka channels. In there, we apply these complex operations as the first thing as we receive these payloads. When you zoom in further, in any given sanitization or cleaning data flow, sub-flow, you will find more sub-flows. This pattern results in data processing broken down into multiple intermediate datasets and sub-pipelines on the cloud platform end. We already had too many datasets and too many pipelines to manage because of the isolation that we were doing. Now due to this breaking down, data flow into sub-flows, is increasing these datasets and data pipelines even further.

A Fresh Set of Pain Points

Our design choices and tooling has enabled us to digest the explosive growth that we have seen. The patterns that I just described were instrumental in making sure that we are able to keep up with the growth. The ease with which we can set up a new pipeline today was biting us back, by making it next to impossible to keep a clean model in our head about data pipelines. We had a fresh set of challenges around operating these pipelines, and how visible the health of these pipelines were. Let's look into those fresh set of challenges.

Data Discovery Needs Heroics

With the explosive growth, datasets under management kept growing, like I said, and the common way to organize that information is documentation. The growth we were seeing, documentation was not keeping up. The other common thing that you see in any data platform, in any company, any data org is you have data heroes, data champions, which are your go-to guys for questions when you cannot figure out where the data is or where a particular piece of information is. We also had them. Obviously, that's bad. That doesn't scale. Beyond the number of datasets, and the isolation pattern that we described, it was actually worsening the problem further. For instance, a device has many sensors, and some of those sensors may be perfectly fine. They're sending data.

We are collecting data end-to-end without any problem. There could be some sensors, or there could be a subset of sensors which have stopped sending data. That is also not a rare scenario. In the context of data discovery, a user cannot just ask, where is the data for device x? Or, where is the data for a certain category of devices? A more pointed question, a more common question is, where is the data for device x of kind Y? Then you throw a dimension of region, for instance, where is the data for device x of kind y, from reason z? Then you throw more dimensions, and you can see how data discovery gets more complicated.

Data Governance, a Formality

The next problem we were seeing was governance. We required since day one, developers to apply datasets and data pipeline changes through pull requests, which was great. The build process could apply some elementary checks, whether syntax is correct, whether you're putting configuration at the right place or not. However, we relied on the human, the person who's looking at the code, reviewing the code to spot data governance gaps, such as regulatory requirements, data security concerns, and naming conventions. Over time, such processes add toil and they are reduced to mere formality. What we wanted was we wanted the build to do all these semantic checks as well.

Is Data Available and Complete?

Next, uncertainty, erroneousness, and noisiness is given with IoT data. We never expect 100% coverage at any given point of time. A device sensor for the same device, some sensors could be sending data while others might fail. The coverage tells you that you might have some data available, but is it 100% or not, is what you're trying to get from the coverage question. We had some handcrafted metrics that would give you a sense of data quality, data coverage for some products, but we didn't have anything that generalizes well across all business units and products. That was very important, over a period of time that we can confidently give an assessment of what is the coverage of our datasets.

How fresh is the Data?

For a real-time data pipeline, or even batch pipeline, data freshness represents the most important service quality metric that you can have. For as long as I can remember, freshness was the single metric that we alerted on most accurately. It is simple, yet effective for all parties involved. Data producers understand it. Data consumers understand it. Data scientists understand it.

Everyone understands it. After a period of time once you've figured out the nuances, it's actually pretty easy to calculate. We had some great tooling to calculate latency for individual steps. However, for someone at the far end of the spectrum, like a data scientist, for them, the data flow starts at the device and it ends at some dataset that they want to query. Anything in between is implementation detail for them. They do not care how long it took for any service step, what they care about is, what was the end-to-end latency? We didn't have a way to expose that. Something that we were really proud of, we thought our service quality indicator in terms of freshness was great. They were actually not that great.

The Solution - DataFlow Toolchain

We covered some problems. We covered problems related to coverage. We covered the problem related to freshness. We didn't have a good story for governance. We didn't have a good story for data discovery. We decided to solve this problem through a toolchain that we now call DataFlow.

A Controlled Vocabulary for Labeling Datasets

Data discovery is not a new problem. You might have that in your team, and you might be using some existing tool or infrastructure to solve data discovery problem. The most basic solution for this data discovery problem is to have a catalog, where your datasets are tagged with some labels. When you think about tags, you have two choices. You can have free text tags, where the user can go do some UI or through some configuration and they can apply any logical tag or any combination of tags that you can imagine, that is free text tags. The second choice that you have is you restrict the tags to our taxonomy, we chose B. We restricted the allowed tags to our taxonomy that we maintained in source control. The reason we did that is that that gave us some consistency.

DataFlow Taxonomy

As a platform engineer, we had little control over how owners curated these datasets. Beyond laying best practices, we chose not to dictate their names, schemas, as that would restrict the expressiveness of the data owner, which is not great. We have an example here. Imagine you have a company which has two business domains, one is dealing with cameras, there's a thermostat business domain, and there are a few regions in which they operate. Each business unit has some product models that they're selling. Typically, in an IoT infrastructure, you divide the deployed devices into few categories. You can have customer category of devices. You can have prototype or engineering category, so that you can test your devices. Then, I explained the concept of publishing mode or trigger mode, periodic or out-of-band. That can be another dimension at which you want to tag datasets at. Finally, you can have data types. Although this is a made-up example of taxonomy, but the structure is pretty real. That is what we do at Tesla. We allow business units to have flexibility in the set of values that allow in certain dimension. You can have any number of products you want, and you choose to call products, whatever you want to call them, obviously. Each business unit has to have all these dimensions, they are common. Everyone has to have region, product, fleet, triggers, and data types, and a combination of these. This is naturally hierarchical. If you take these files and put them into a single data structure, you get a tree. What is the path in the tree?

The Logical Dataset

A path in this hierarchical taxonomy represents a label, so you can take any path in the tree, as long as that's a valid path in the tree, we allow that to be our valid label. Imagine that there is a dataset which represents data for a certain model of camera in the engineering fleet and data is being pulled manually in that dataset, and the data type in the dataset is of telemetry kind. Then we can use that example taxonomy that I just showed you, and we can slap this label in. This dataset can physically rest in multiple datasets. You can have some Kafka topics in which this dataset is present. You can have a random access database in which dataset is present. You can have some batch database in which dataset is present. Logically, they are all the same, and so you can slap a single label, and that would be your logical dataset.

A Unified Abstraction for Lineage and Catalog

We have defined the concept of taxonomy. We have defined the concept of logical dataset. We're getting towards a solution for data catalog, which is great. All we need is, given a label, we need to figure out what all datasets are covered in that dataset, which is easy. The other problem we had was around operability and visibility. How do we deal with the number of pipelines that we had? How do we figure out what is the end-to-end latencies? We needed some data lineage abstraction. We wanted to capture the flow as data is moving from one dataset to other datasets, all the steps involved in some abstraction, and that is data lineage abstraction. We decided it would be prudent if you can combine data catalog and data lineage into one single abstraction. That abstraction was graph.

DataFlow Graph

We came up with the concept of DataFlow graph. DataFlow for a device sensor begins at that sensor as the origin and ends at a dataset in one or more datastores. We break down the flow into multiple intermediate datasets and sub-flows. Data lineage captures this flow from its source to its destination, via various changes and hops on its way. How the data gets transformed along the way, and how it splits, converges after a stage. Naturally, it is best represented as a graph. If it is a graph, what are the vertices and edges? Each physical dataset, for example, a Kafka topic in some proprietary format, or a Kafka topic for the same proprietary format in canonical format, like a standard format, in our case, that is Avro. Then same data in a time-series database, and same dataset in a batch database, they all become dataset or vertices in this graph, and they're all labeled with the same taxonomy label. The structure of dataset in this slide has a name, a datastore in which it lives, then it has a collection of labels. Finally, it has a tier. The concept of tier is used in our case to organize our alerts.

DataFlow Transformation as Edge

We have defined how we capture vertices, let's talk about edges. We represent intermediate steps as DataFlow steps. These steps are Kafka streams, or batch jobs, are powered by Spark or MapReduce. A single DataFlow step can consume multiple datasets and produce multiple datasets. Hence, it can translate into multiple edges in the graph. We represent edge as data flow transformation. DataFlow transformation has the step which produced this transformation, and then it has the tuple, the source and target datasets which are involved in this DataFlow transformation. We have defined what it means for vertex to be in the graph, and what is an edge in this graph, but how do we put the graph together?

In any mature data platform, after a period of time, you will find there are multiple frameworks involved. You might have some streaming technology, you might have some batch frameworks involved. You will also have multiple storage technologies involved. In our case, we have Kafka, we have HBase, we have HDFS, and the configuration. How do you create a new topic? How do you create a new table in time-series database? How do you create a new table in HBase? How do you create a new table in HDFS? It's entirely possible for those things to live in various different ways. You can have a team which is completely operating in cowboy mode, where they go to a terminal and just create a table there, or you can have a more disciplined team which is writing tools, which can read configuration and create these datasets and data pipelines for you. Even if you have everything in the configuration file, then this information can be spread into multiple repositories. The problem with having the staggered information is that it's very hard to create what I just described, because you need to pull information from multiple sources and put into one place.

ETL-ing the Graph

Luckily, in our case, even before we started with DataFlow, one of the good things that we have been doing is every change goes through source control so we didn't have a problem of someone doing something random directly in production. We just had one repository. We use a monorepository that we built using a build tool called Bazel. Bazel organizes build rules in a simple configuration where you can feed in the code source, and then you can change the dependency graph. This example is a very high level example of how we actually do it. We define a file group called dataset in which we feed all the configuration files, which configure dataset into all different kinds of databases that we have: Kafka topics, HBase tables, HDFS tables. Then we have a file group, which captures all the data pipelines. We have the real-time data pipelines that we operate through a framework that we built called Kafka channels. You can also have MapReduce jobs and your Spark jobs in this file group.

Then both those are fed into the final rule, which is the graph generating rule as a dependency. If anything changes, the final rule will retrigger and it will regenerate the graph. Bazel captures this dependency and change detection very cleanly. Another thing to note is we are not storing our catalog and lineage into any fancy database. Even with thousands of datasets and thousands of pipelines, this information when serialized as data structures can be stored in kilobytes. What we have is a binary representation of our catalog and data lineage that we can supply to any downstream tool or any downstream service that wants it. It regenerates for every pull request, so that we can now apply strong semantic checks as well, which was one of the problems that we had. We can verify that there is a dataset A and dataset B, and they should never be connected in any scenarios, because dataset A can have some private data that we do not want to make it to an openly accessible dataset B. That check can be put into the build stage.

Domain Agnostic Data Quality Indicators

Taxonomy and graph gave us a foundation for operating the elementary pipelines. Next, look at a building block, which gives us some hold on visibility. The common gist of visibility is that you should be able to monitor the health of a complex system by just looking into inputs and outputs of the system. We describe that freshness as a very strong indicator for figuring out the service quality of data pipelines, but there are problems such as coverage and completeness, which go beyond freshness. How did we solve them?

DataFlow Trace

Measuring data quality requires looking at data in aggregate and identifying patterns rather than individual events. The rate of individual events that we see is about 30 million events per second that is a peak rate that translates to trillions of events per day. If we had to go and infer data quality, it's perfectly doable, but that is very expensive in terms of cost, competition costs, and it is prone to lag. If we take a step back and think from first principle, what is needed to, for example, derive coverage. For every payload that we receive from the device, if we can record the fact that we received the payload at a certain time. Next, if we can associate that payload with a device ID. Finally, if we can associate and record where we actually received that payload. With these three pieces of information, we can actually do very interesting things.

Let's take all of these things and capture that as a meta record call, trace record. That's what we do. We have a concept of DataFlow trace. A blessing in disguise, in case of IoT, data always comes in payload. The payload rate is actually not a lot, maybe a billion per day. When we explode those payloads, when we process those payloads, that's when we get into 30 million per second throughput. We record traces at raw payload level, and we store those traces. Every payload that moves, the very first entry point record that we receive something, this dataset, this time, for this device, and we take that, and we store that as a separate meta table. By analyzing that meta table, we can do very interesting things, for example, coverage.

DataFlow Coverage

This is a real dashboard, which gives us a sense of coverage for any given product and transmission mode across all business units. It's a very general solution. For every transmission mode, whether it's pseudonymous, urgent, or manual, in our case, we know the number of devices which are expected to send data. There are about 790K devices which should be sending us data. Then for a fixed period of time, daily, weekly, and monthly, we know what devices are actually sending us data through tracing. Then once we have these pieces of information, we can calculate coverage and we can set up monitoring on that. In certain cases, we want coverage to be very high. In some cases, we want coverage to be low. For example, we do not want urgent DataFlow to see too much activity because that means something is wrong.

End-to-End Freshness

Next, end-to-end freshness. We always had very good tools for capturing latency of individual steps in DataFlow. By putting all those datasets and data pipelines as vertices and edges in the graph, it's very easy to calculate end-to-end freshness, because that reduces to a graph problem. Given a dataset, as of today, we can figure out all the datastores in which this data lives and all the steps which are involved in making sure that data is making there. We calculate the latencies of that. We produce this neat dashboard and also output end-to-end latency so that users don't have to worry about, but in case they do, they also have this.

Conclusion

IoT data is quite unique in many ways. Although you can apply the patterns that we apply in conventional systems, but there will be some edge cases which require first principle based thinking. Once you apply those, once you sit back and look at the problem holistically, you can come up with very simple solutions, which give you a very good operational and observability hold on these data pipelines.

See more presentations with transcripts

Recorded at:

Jun 25, 2021

Shrijeet Paliwal

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?