Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Tesla's Virtual Power Plant

Tesla's Virtual Power Plant



The speakers explore the architecture of the Tesla Energy Platform including the use of asset hierarchies, functional programming techniques, trade-offs in edge vs. cloud computing.


Natalie DellaMaria is Senior Distributed Systems Engineer @Tesla Energy Cloud Platform. Hector Veiga Ortiz is Staff Distributed Systems Engineer @Tesla Energy Cloud Platform.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


DellaMaria: Recent extreme weather events such as Hurricane Ian, a category 4 hurricane, and the heat waves of California that broke hundreds of records, result in devastating power outages that lead to economic losses and harm to human well-being. In the case of Hurricane Ian, we have damaged the physical infrastructure that inhibits the capability of the grid to generate and transmit electricity. In the case of the California heat wave, we have excess demand from residents, businesses, and critical operations that require additional electricity to run cooling systems, air conditioning, and other operations to keep people safe and operations running smoothly. This causes stress on the grid. Since the grid can't store energy, it means this electricity has to be provided in real time to meet these loads. What this looks like for grid operators is this graph right here. This graph could represent a day of the heat wave for the California grid operator. The blue line represents the grid capacity and the orange line represents forecasted demand. We start to get into issues when the graph looks like this. This has a forecasted demand surpassing the capacity of the grid. This is when grid operators start to initiate rolling blackouts as was the case with the latest California heat wave. Some of you that live in California might even remember receiving daily statewide alerts, urging you to reduce your energy consumption because we were on a brink of these blackouts. The reasons that grid operators do this is because they need to distribute and try to balance the available load and minimize impact to consumers, but, unfortunately, leads to these rotating blackouts. We can see with these recent weather events, the importance of having independent energy, and energy security. We're going to talk about how Tesla utilizes distributed energy resources to ensure our customers individually and communitively have energy security.


I'm Natalie DellaMaria.

Ortiz: I am Hector Veiga. What we're going to go through is that we're going to explain the benefits of having residential batteries, and how we can extend our utilities using software. Then we're going to explain how the Tesla Energy platform can increase energy security for a single home. Then, how we can use the same platform to orchestrate fleets of batteries to increase energy security for entire communities. Finally, we will show how to have an integrated user experience where we can give customers control during these critical times.

Why Batteries?

Why Batteries? Traditional power generation has come from burning fossil fuels. That is a source of energy that is reliable and consistent. Renewable sources such as solar or wind are more inconsistent and variable. You may get a lot of sunlight and wind and generate energy when it's not really needed, or you don't have enough, when it's needed the most. That's where it varies, and software really come in. When we couple batteries and renewable energy, we can store the excess energy in those batteries to be used later and reduce the need to burn fossil fuels. A good example is the Kapolei Energy Storage facility in Hawaii. This energy installation is capable of storing the amount of energy that can power 700,000 homes for an hour. This is also impacting the island by not having the need to have peaking power plants. Peaking power plants or peaker plants, commonly known, are power plants that only run when there is a high need of energy demand, normally burning fossil fuels and due to this irregular use of these plants, the energy generated from them is actually very expensive. Residential batteries are also great. They help our customers to control their solar energy generation. They provide backup in case of power outages. Where they really shine is when you couple them with software.

Why Software?

DellaMaria: Then, why software? We could run software on the edge on devices that utilize information local to the device such as solar production, or historical home load. We also run software on the cloud to relay other information to enable this autonomous decision making on the device. We can also utilize software to create integrated user experiences across products and things such as the mobile app. We can also utilize cloud software to aggregate fleets of individual devices, to expand the impact from an individual battery impact to community impact in what's known as a virtual power plant. We can think of simple examples of how do we use software to create good customer experiences with their batteries. This is an example of a customer updating their backup reserve, and this is the value that the battery will store in case of a grid outage. There's times when we want more complex features. For most of our customers, they want to keep a low backup reserve most of the time. This enables them to have capacity to store their own solar. However, when there's a high likelihood of a grid outage, they'd like to be able to adjust their backup reserve to full, set their battery charge to full and they have a full battery in case the grid goes out. Storms are the leading cause of grid outages, and no one likes to manually watch the weather channel waiting to update their backup reserve. Thankfully, Tesla will do it for you. When you enable StormWatch on your device, it will preemptively charge your battery to full ahead of inclement weather. This way you can leave your backup reserve value low and you don't have to worry about not having enough power in your battery to charge you when the grid goes out.

The Tesla Energy Platform

Ortiz: First, let's speak about our platform and how it powers programs like StormWatch. Tesla devices such as batteries are connected to the cloud using secure WebSockets. The connection is bidirectional. From the edge to the cloud, devices can report their alerts and their measurements to inform about their current state. The set of services in the platform that receive, handle, store, and provide these data points are under telemetry domain. Then, we need to give context to this data. That's where asset services come in. Asset services basically keep the relationships between devices, customers, and installations. Having accurate asset data is critical to power programs such as StormWatch or the virtual power plant. Finally, from the cloud to the edge, we have another set of services that can command and control those devices. These actions that we can send to devices can be things similar to firmware updates, configuration changes, or device mode transitions, which are like charge the battery or discharge the battery. Messages exchanged between the cloud and the edge are formatted using Google protocol buffers. Google protocol buffers or protobuf are a language agnostic format. They're strongly typed, performant. They allow for schema evolution as long as you follow a few simple restrictions. Since they are contract based, what we have is a shared repository where we keep those protobuf files, and both edge and cloud developers can propose additions and changes.

The three domains in the platform are actually formed by thousands of services, but the bulk of them are written using Akka and Scala. Akka is a toolkit based on the actor model written in Scala. Scala is a functional programming language that runs on the Java Virtual Machine. We use Akka in many aspects of our development and operations. For example, when we need to scale up our applications, we rely on the Akka cluster and Akka cluster sharding modules. It also helps with our stateful applications, which we have a few, through the Akka persistence module where it helps you save the state and restart it when needed. We heavily utilize also their streaming tools, Akka Streams and Alpakka, because they provide built-in backpressure, backoff retry capabilities, and external connectors to reliably interact with other technologies. The Akka ecosystem has proven to be performant and reliable enough to meet our needs.

Let's talk about a scenario that we need to deal with where we use some of these technologies. Normally, our devices are connected to the internet when they report their data. Sometimes there's some internet connectivity outage due to an internet service provider problem or a Wi-Fi spotty connection. In those cases, the devices keep collecting measurements, and they basically buffer them locally. When the connection is restored, then they send all the data to our servers, generating more load than expected. Then, to handle this scenario, we use a combination of tools. We use Apache Kafka to store this data temporarily in a distributed topic. Then we rely on the Alpakka Kafka connector to consume this data. If we're using the Akka Streams built-in backpressure, the consuming services do not consume all this data immediately, but they do their best to consume as fast as possible. However, that results in incrementing lag on our topic. To actually speed up this process, we actually use Kubernetes Horizontal Pod Autoscalers or HPAs. Where an HPA monitors a metric, in this case lag, and if it approaches a predefined threshold, it creates more pods to help decrease the lag as soon as possible. Once we have consumed all the lag, the HPAs get rid of the unnecessary pods and we go back to normal operation.

The services in the platform are exposed through a set of standard APIs. We have two flavors. We do expose services using REST, the common REST architecture using HTTP/1.1, but we have heavily invested in gRPC, which in our experience offers some advantages over the traditional REST architecture. The main one is that service and messages are contract based, are defined by protobuf. That makes sure that whatever messages your request and your response are is what you're expecting. Also, thanks to this contract-based solution, there are tools to generate code for both client and server that helps us move faster, and don't repeat ourselves by wasting time implementing that code. Also, since we run gRPC over HTTP/2, HTTP/2 has a better use of connections because it implements streaming and multiplexing. It is through these APIs that the platform allows other systems to provide higher order functions, such as metric dashboards. It powers UIs like the Tesla mobile app, or the programs that run on top of the platform, like StormWatch or the virtual power plant.

DellaMaria: Sometimes our platform services need to interact with third parties, as is the case with StormWatch. We rely on third parties to provide us with weather alerts that let us know which areas will be affected by inclement weather. These areas are designated either by a list of regions or by polygons that can consist of hundreds of latitude and longitudinal points. We've abstracted away the processing of these incoming alerts to an internal service we call Weather API. Having this internal layer decouples the rest of our platform services from our external data providers. This means our data providers can be updated, changed, added, or moved, and we have the flexibility to do all that without affecting our clients in our platform services. They maintain consistent data integration through our Weather API. Another key functionality of the Weather API is transforming these various forms of geographic references into a standard format. We use the WKT or Well-Known Text format, which is just a string representation of a geospatial object. What this might look like is this, for the hurricane warning that we saw earlier. This is an example of a WKT polygon. From the context of our StormWatch service, we are constantly pulling Weather API for these active alerts.

StormWatch Architecture

I'm going to talk a little bit about the architecture of our StormWatch service. It's running on multiple pods in Kubernetes that form a single service. We model individual customer batteries as actors. Actors are objects that contain some state and will communicate with asynchronous message parsing. These battery installation actors are distributed across the pods in our clustered application. We utilize the Akka framework here to support our actor system. It abstracts away the need for us to create and handle our own locks in order to prevent concurrency and race conditions when reading and writing to the state of these distributed actors. It also supports location transparency. This is the concept that the sender of a message to one of these individual entities doesn't need to know the physical location of that actor or what pod it is running on. This is really helpful.

The way we pull for weather alerts is with actually a singleton actor since there's only one running on all of the pods, and only single pod in the cluster. This is because we really only need to get the alerts once and then can then distribute them amongst the affected actors. It's really important that our actors can handle alert, duplications, updates, creations, and cancellations. This requires individual battery actors to maintain some state. Again, this is very important because there's downstream impacts in the real world such as notifications, and then battery control plans. Going back to our Hurricane Ian story, we've detected an alert that's been picked up in our Weather API. It's constructed a standardized format for any of our services to consume. Our StormWatch service then picks it up in its next pulling cycle. The next step is how to determine which batteries are actually going to be affected by this weather alert. To do that we utilize optimized geosearch queries in our asset service. Our asset service is backed by a Postgres database and uses PostGIS which is a geospatial database extender.

Here we have this incoming alert, and we're going to focus on the hurricane warning. This red polygon right here. This is the WKT polygon that we saw earlier. It's pretty complex. Our initial approach was to use this direct WKT polygon, now that's pretty complex, and do a direct intersection query in our asset database. However, these queries were taking a very long time. They were taking resources away from our concurrent queries, so this query was getting canceled, due to the replication routine happening in the database. Instead of utilizing this complex polygon, we instead decided to use an approximate polygon. This is the 2D bounding box surrounding the complex polygon that we saw earlier. This was much quicker. PostGIS works really well with rectangular polygons and uses index only scans. It means it doesn't have to go to the buffer cache or to disk to retrieve the index, it's right there in memory. This turned the installation batteries much quicker.

However, we now faced an issue where we might have some false positives. We might have some sites that fall within our approximate query that don't fall in with our original direct complex polygon, sites like this one. We implemented a process, client side, that would iterate through these installation candidates that came back from our approximate query. We then filter out the ones that don't fall into the complex query. You might be wondering, why is it faster to doing this two-step process? We're getting the same result. These same batteries are being returned in this two-step process that would have been returned had we done all this in our database with the original complex direct query. That's because this final process to iterate through each installation battery and determine if it intersects with a complex query, takes a lot of CPU. When that CPU is being used in our database node, it's taking CPU away from concurrent queries that were coming from the rest of our platform services. We moved that to a stateless process. We can then scale this stateless process horizontally or per pod. That's much easier than having to continuously upsize our database, which is why we now have this multi-step process.

Now that we've resolved the installations that are actually affected by this incoming alert, we can then message them and let them know, so then they can make independent control decisions based off each installation's system capabilities, and customer preferences. Now that we fanned out, each of these actors is processing in parallel using streaming, as that means that one slow site is not going to affect another one's site dispatch. We use this system to ensure that our customers can have backup support when they need it the most. This is an example of the mobile app during an active storm. We can see the battery is charging from the grid, and StormWatch is active, notifying our users why their battery is behaving this way.

Takeaways (StormWatch Architecture)

Some key takeaways here is, if you can abstract interfaces with external data providers to an internal service, then this decouples the rest of your platform services from external providers and prevents data provider lock-in, or being limited to your external data provider's interface. Geosearches and other queries can be expensive. If you can switch to a multi-step approach and move out CPU intensive tasks to a stateless service, it's much more scalable. In-order message processing in a clustered environment or distributed environments, can take extra considerations. We utilize the actor system to enable us to decide when we want to have parallel processing, as is the case with each of our individual battery actors processing per site capabilities, and when we want to handle things only once per cluster, and in a serialized format, as we do with our weather pulling.

The Hierarchies Approach

Ortiz: StormWatch has proven to increase individual energy security for thousands of customers, but we're going to strive for more. We know there is local value in having a residential battery, but how can we take that local value and provide energy security for entire communities? Power outages caused by grid stress, like it happened during the California heat wave are a bit different than the ones that are happening due to damage to physical infrastructure, like it happened during Hurricane Ian. They are basically a deficit in the amount of available energy. If we can provide an interface for customers that have the battery systems that can store energy and utilities who need the support, then we can basically let customers opt in and support the grid. Tesla is in a unique position to provide this value. The platform that controls the batteries is already in place, and we can send control plans to them. Tesla is the actual critical link between the utilities that need the support and the customers who can provide this value. When we see a graph like this, we are all in trouble. There might be blackouts and those can be problematic. The first thing that we need to do to help the grid is identify who can help the grid. We do that by allowing eligible customers to join emergency grid support programs. We do that through the mobile app, which is an app that they already own. That thing is, in our opinion, novel. We know that utilities are normally not moving as fast as technology companies do. In these cases, it is critical that we move together fast to be able to support the grid. Also, for customers that are participating in these events, they are getting some compensation for it.

Our software builds upon the same foundational Tesla Energy platform that we were using for the StormWatch feature. One critical component for our virtual power plants programs is to basically identify and group different installations based on their utility instead of their geographic region. We have an available tool in our platform to do that, and it's called hierarchies. A hierarchy is a logical way to group installation based on a common characteristic or feature. You can think about it as a way to tag a group of installations. These installations or these groups are identified by a single identifier, and it is through this identifier in our platform where you can resolve which installations belong to that particular group. The hierarchies tool also allow you to aggregate groups into a larger group, creating a multi-layer tree-like schema. Hierarchies also bring fine-grained security into picture because asset services can provide or deny access based on the group identifier. Asset services rely on Postgres to store all the asset data, but in this particular case, they use the ltree extension instead of the PostGIS extension. The ltree extension is an extension for Postgres that allows you to manage and query efficiently a tree-like structure. Let's see an example.

We can create a really simple table that we will call a hierarchy, with only one column, that will be called path of the type, ltree. Then we can insert some sample values. Here in the sample values, we can see that the ltree extension uses the (.) as the connector between nodes in the tree. How can we actually get children of a given node in this tree? The ltree extension provides a new operator, the <@, that allows you to query for those children. In this example, we can get what are the installations under the North California VPP eligible group. Also, we might want to know, what are all the groups that a particular installation belong to? Then we need to basically query for parents. The way we do this is by joining the table with itself and then use the other provided operator that is @>.

In the context of our VPP programs, our enrollment flow also relies on the hierarchies table, where each group represents a state in the enrollment flow, from eligible, to pending, to enrolled. All these transitions are actually recorded by making a request to asset services. Also, if a customer decides to no longer participate in these programs, then the enrollment flow will make the appropriate request to asset services to remove their installations from those groups. You might be wondering why we decided to run this enrollment flow with hierarchies instead of a simpler tagging solution. Hierarchies was an already available solution in our platform and the California VPP program had to be developed in a very short period of time, so we were able to reuse a solution to this particular case. Also, the enrollment flow now can be reused for future programs. Having these well-defined segregated groups also helps us reach out to customers depending on their needs. For example, this is a notification a customer will receive if they are in the eligible group and they have not yet enrolled, or a notification that a customer that has enrolled will receive if there's a scheduled event in the near future.

Thanks to this multi-layer hierarchy approach, we can identify what are all the installations under a whole program by creating a group higher in the hierarchy that contains all the enrollment groups under it. Now that we have identified who can help the grid, then we can use the same digital twin architecture that we were using for StormWatch to track event changes and customer participation changes. In this case, instead of getting events from weather providers, we get events from our relationship with the grid operators. Once we get an event, we basically resolve which installations can help support the grid and dispatch control plans to them. In this case, the control plans has two phases. The first phase is to charge the battery as much as you can during off-peak hours. The second phase is to discharge during the event until the duration of the event, or until the battery hits its backup percent limit. When we combine our hierarchies tool with our telemetry pipelines, we can get aggregated energy data for any given group. In the context of our VPPs, this can show what is the total contribution that a particular group is discharging into the grid. In this particular view that our customer that has enrolled and is actually participating on one of these events, the customer will be able to see their contribution, which is like 3.1 kilowatts, but what is the total contribution, highlighting that they are part of a larger community impact.

During the California heat wave event, there were multiple dispatching events. We were able to peak 33 megawatts, which is the same power of a small peaker plant. This was thanks to over 4500 customers who enrolled in this Tesla emergency grid support programs that allow Tesla and utilities to preemptively charge their batteries and discharge them during times of high demand. These customers were compensated $2 per kilowatt hour discharged to the grid. Tesla VPP programs created the largest distributed battery in the world to help keep California's energy clean and reliable. Providing an engaging user experience where customers have the agency to feel that they are part of something larger than the installation.

Takeaways (The Hierarchies Approach)

Some takeaways here. Hierarchies as a multi-level tagging and security tool that allows you to identify quickly entities through the ltree extension. Reusability. We're big fans of reusing components, and make them generic so we can move and deliver faster. Short-notice trigger events requires you to do as much preprocessing as possible to ensure a successful outcome, in our case, understanding who can participate in these programs. Dynamic programs like virtual power plant actually need reactive processes and applications that need to provide value at any given moment and handle any adverse scenarios such as devices becoming offline, or customers enrolling or unenrolling at any given moment.

Recap - Tesla VPPs and StormWatch Features

DellaMaria: The Tesla Virtual Power Plants and StormWatch features are two critical user facing applications that from a powerful perspective are complete opposites, but from a software perspective are actually quite similar. Let's recap how both these features work. With both features, we need to detect discrete events. We need to resolve which homes are affected. We need to run arbitration and decision making to determine if these batteries are capable and should participate in the event. We need to notify our customers. We need to actually communicate to the battery their respective control plans. We can already see the duality in these two features, basically two sides of the same coin, and behind powering both these feature experiences is the same application. This enables our customers to both enroll in our virtual power plant programs, at the same time, they can enable the StormWatch feature to ensure they have the independent energy security. Having this reuse of this same architecture allows us to save reimplementation of common logic and common integrations. It also provides a clean space for when we need to do event arbitration. When we do have conflicting StormWatch events with our virtual power plant events, we can utilize the fact that these customer battery actors already need to store some state to support each independent feature. They need to be able to handle future alert updates, cancellations, duplications. All that's required to support conflicting events is to add minimal logic into these existing actor entities to then respect customer priorities. All I have to do is add a simple experience to the mobile app to give our customers control and agency over exactly how they want their battery to be used in such scenarios.

It's a very dynamic and a very bursty system. When we have events going on, it's very active. There's a lot of messaging thing happening between the mobile app and external events. However, it's relatively calm at all other times. This is a system that you can't shed load. When it's very active and we have a lot of load, is when it's very critical. This is when our customers need us the most. Ways we handle this is utilizing reactive streaming concepts throughout the system, and with how we interact with our downstream services. This includes backoff retries, backpressure, and throttling. We also utilize parallelization between the actors. Once we fan out to each individual battery actor entity, those processes and streaming is happening in parallel. This way, one offline device is not going to slow down or stop another customer's experience. Additionally, we made sure that this application was horizontally scalable. These actors are persistent and cluster sharded. That means they will store their state in a persistence datastore that we use, backed by the Akka framework. Then when we spin up new pods to handle and distribute load, these actor entities could then move over to that new pod and use those new resources. This allows us to scale to meet our growing user base.

Another key thing to notice is that these battery installation actors also support the mobile app experience. These digital twins that really represent the physical batteries of an owner, they can then see the state of them right there in the mobile app. As they make changes, they can opt in and out of individual events, programs as a whole, and update that priority, and that will get sent down to their battery and then reflected back in the mobile app, showing them the state of their system in real time. Then this, again, enables them to have unique control over how they want their battery to be used at any time, enabling them for their own use cases, like this customer was able to opt out of an individual storm event because he already had full capacity based off his excess solar at his own home.


Some takeaways here is, design upfront for horizontal scalability. We took the time initially to implement persistent cluster sharded actors. This gives us the freedom to know that we can just upscale our application horizontally when we need to. Reactive streaming concepts are very critical when you have dynamic and bursty systems that can't shed load, so that backoff retries, throttling, backpressure are very critical. Give control to your customers through great mobile app experiences or other integrated user experiences. This allows you to have simpler logic in your own backend systems. Don't be afraid to use toolkits to help in building these distributed systems. We are supported by Akka, but you can also implement similar things in Erlang/OTP, and other functional programming toolkits.

Why Cloud?

Ortiz: With the increasing capabilities of the edge devices where those keep having more memory and more CPU these days, some of you might wonder why we're running this decision making and arbitration in the cloud instead of in the devices themselves. There are actually pros and cons for both approaches. Since what we're interested in is in actually moving faster, we can iterate faster on the cloud than at the edge. That helps us create new features quickly and mature these existing programs quicker. In addition, we have the luxury to have vertical integration when we manage both the cloud software and the edge software. We can selectively send down pieces of our logic to our edge, where we believe those could make a great impact and help expand our programs and features.

Looking Forward

What is next? Traditionally, our customers have thought about batteries or residential batteries, as a way to only increase individual energy security. With this new virtual power plant programs, that is a paradigm still unknown to many, this expands these possibilities. As these things are becoming more popular, the interest is actually growing. Is this the way to think about batteries in the future? We don't believe so. We think it is the way to think about batteries right now. We believe that right now are in an educational period. Utilities and customers are learning about these programs. As these programs are becoming more widespread, more people want to become part of them. Thanks to the California heat wave event, we were able to showcase that this is a real use case for batteries and there's a high potential to grow and scale. Also, from the software perspective, a battery either small or large is the same to manage. This very fact actually opens the possibility to expand current programs and create new ones.


It is thanks to the combination of smart connected devices such as batteries, coupled with reliable cloud software such as the Tesla Energy platform, that we continue pushing the limits to increase energy security, both individually and collectively, to increase the quality of life, to accelerating the world transition to sustainable energy.

DellaMaria: If you're interested in learning more about the Tesla Virtual Power Plants, or the Tesla Energy platform, we highly recommend our colleagues Colin Breck and Percy Link's talk that they gave at QCon London in 2020. Then, if you're interested in helping solving hard real-world problems, helping increase energy security and supporting the grid, while accelerating the transition to a sustainable world, please come join us at Tesla.


See more presentations with transcripts


Recorded at:

Jul 15, 2023