Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Tesla Virtual Power Plant

Tesla Virtual Power Plant



Colin Breck and Percy Link explore the evolution of Tesla's Virtual Power Plant (VPP) architecture. A VPP is a network of distributed energy-resources (often solar, wind, batteries) that are aggregated to provide smarter and more flexible power generation, distribution, and availability. Tesla's VPP consists of vertically integrated hardware and software, including both cloud and edge computing.


Colin Breck is a Sr. Staff Software Engineer at Tesla. He works on distributed systems for the monitoring, aggregation, optimization, and control of distributed-energy assets, including solar generation, battery storage, and the Supercharging network. Percy Link is a staff software engineer on the Energy Optimization team at Tesla, working on the Autobidder platform.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Link: The electric grid is the largest and most complex machine ever built. It's an amazing feat of engineering providing reliable, safe, and on-demand power. This grid is built on 20th-century technology with large, centralized generation, mostly fossil fuel-based and only a few points of control. We now face an urgent challenge to transition off of fossil fuels in order to prevent the worst effects of climate change. Fortunately, we also now have new tools. Clean power generation like wind, solar, and hydro are cheap and getting cheaper. This hardware is not on its own enough to replace fossil fuels while maintaining our current standard of on-demand and reliable power.

Breck: Software is really the key to enabling these diverse components to act in concert. One of the things we can do is bring together thousands of small batteries in people's homes to create virtual power plants, providing value to both the electrical grid as well as to the home or business owner. This marries some of the most interesting and challenging problems in distributed computing, with some of the most important and challenging problems in distributed renewable energy. This is why we work at Tesla. We get to work on these exciting software challenges while also accelerating the world's transition to renewable energy. We're going to take you through the evolution of the Tesla virtual power plant and share architectures, patterns, and practices for distributed computing and IoT that have helped us tackle these complex and exciting challenges.

Link: I'm Percy [Link]. I'm a software engineer and technical lead on the team that builds Tesla's energy optimization and market participation platform.

Breck: I'm Colin [Breck]. I'm a software engineer, and I lead the teams that build and operate the cloud IoT platforms for Tesla energy products. Just a disclaimer before we start. We do not speak on behalf of Tesla, we're just representing our personal experiences.

How Does the Grid Work?

Link: Before we dig into the software, let's cover some background on how the grid works and on the role that batteries plan it so that you're set up to appreciate the software problems. The tricky thing about the power grid is that supply and demand have to match in real-time or else frequency and voltage can deviate, and this can damage devices and lead to blackouts. The grid itself has no ability to store power, so the incoming power supply and outgoing power consumption need to be controlled in a way that maintains the balance.

With old-style centralized fossil fuel generation, supply could be turned up and down according to demand, and there were relatively small number of plants to control. This made it relatively straightforward to maintain the balance. As more renewable generation comes onto the grid, a few things happen. First, reduced control. Generation can't be as easily turned up to follow demand, and we don't want to turn generation down or else we're losing some of our clean energy. Second, uncertainty and rapid change. Generation can't be forecast precisely, and it can change quickly. Third, distribution. There are many small generators behaving independently.

In a grid with large amounts of wind and solar generation, the supply might look something like this with variability in supply and with times of high supply not aligned with times of high demand. This can result in power surpluses and deficits that are larger than previously and that can change fairly rapidly. Batteries can charge during the surpluses and discharge during the deficits, and they can respond very quickly to offset any rapid swings and imbalance. This rapid response is actually even an innovation, an opportunity to be better than the old grid. It's not just a compromise.

To fulfill this role, we could just install giant batteries, batteries the size of typical coal or natural gas power plant, and they can and do play an important part of the equation. We can also take advantage of smaller batteries already installed in individual homes that are already providing local value like backup power or helping the owner consume more of their own solar generation. We can aggregate homes and businesses with these smaller batteries and solar into virtual power plants.

In this presentation, we'll walk you through the evolution of the Tesla energy platform for virtual power plants. It's broken into four sections, with each stage laying the foundation for the next. We'll start with the development of the Tesla energy platform. Then we'll describe how we learn to participate in energy markets, and how we learn to build software to do this algorithmically using a single battery, the largest battery in the world. Then we'll talk about our first virtual power plant, where we learn to aggregate and directly control thousands of batteries in near real-time in people's homes. Finally, we'll talk about how we combine all of these platforms and experiences to aggregate, optimize, and control thousands of batteries for energy market participation.

Tesla Energy Platform

Breck: Let's begin with the architecture of the Tesla energy platform. This platform was built for both residential and industrial customers. For residential customers, the platform supports products like the power wall home battery, which can provide backup power for a house for hours or days in the event of a power outage, solar roof, which produces power from beautiful roofing tiles, and retrofit solar. The solar products can be paired with power wall to provide, not only backup power, but also maximize solar energy production.

We use software to deliver an integrated product experience across solar generation, energy storage, backup power, transportation, and vehicle charging, as well as create unique products like Storm Watch, where we will charge your power wall to full when alerted to an approaching storm so that you have full backup power if the power goes out. Part of the customer experience is viewing the real-time performance of this system in the mobile app, and customers can control some behaviors such as prioritizing charging during low-cost times.

For industrial customers, the software platform supports products like power pack and megapack for large scale energy storage, as well as industrial-scale solar. Software platforms like PowerHub allows customers to monitor the performance of their systems in real-time or inspect historical performance over days, weeks, or even years. Now, these products for solar generation, energy storage, transportation, and charging, all have an edge computing platform. Zooming in on that edge computing platform for energy, it's used to interface with a diverse set of sensors and controllers, things like inverters, bus controllers, and power stages.

It runs a full Linux operating system and provides local data storage, computation, and control, while also maintaining bi-directional streaming communication with the cloud over WebSocket so that it can regularly send measurements to the cloud for some applications as frequently as once a second. It can also be commanded on-demand from the cloud. Now, we'll mention a few things throughout the presentation about this edge computing platform, but our main focus is going to be on that cloud IoT platform.

The foundation of this platform is this linearly scalable WebSocket front end that handles connectivity as well as security. It has a Kafka cluster behind it for ingesting large volumes of telemetry from millions of IoT devices. This provides messaging durability, decouples publishers of data from consumers of data, and it allows for sharing this telemetry across many downstream services. The platform also has a service for publish-subscribe messaging, enabling bi-directional command and control of IoT devices. These three services together are offered as a shared infrastructure throughout Tesla on which we build higher-order services. On the other side of the equation are these customer-facing applications supporting the products that I just highlighted.

The APIs for energy products are organized broadly into three domains. The first are APIs for querying telemetry alerts and events from devices or streaming these as they happen. Second are APIs for describing energy assets and the relationships among these assets, and lastly, APIs for commanding and controlling energy devices like batteries. Now, the backing services for these APIs are composed of approximately 150 polyglot microservices, far too many to detail in this presentation. I'll just provide a high-level understanding of the microservices in each domain.

We're going to dive a bit deeper into a few of them later when we look at the virtual power plant. The theme you'll see throughout is the challenge of handling real-time data at IoT scale. Imagine the battery installed in everybody's home. To support efficient queries and roll-ups of telemetry, these would be queries like, what's the power output over the past day or week, we use InfluxDB, which is an open-source purpose-built time-series database. It depends on the data stream and the application, but generally, our goal is to make historical data available to the customer for the lifetime of the product. We maintain a large number of low latency streaming services for data ingestion and transformation.

For some of these Kafka topics, the very first thing we do is create a canonical topic where data are already filtered and refined into very strict data types. This is more efficient because it removes this burden from every downstream service, and it also provides consistency across the downstream consumers. A very unique challenge in this domain is the streaming real-time aggregation of telemetry from thousands of batteries. This is a service that we'll look at in much more detail because it forms one of the foundations of the virtual power plant.

Like any large company, product and customer information comes from many different business systems. It's really unworkable to have every microservice connect to every business system, many of which are not designed to be internet-facing or IoT scale. The purposes of the asset management services, there's four things. One is to abstract and unify these many different business systems into one consistent API. Two is to provide a consistent source of truth, especially when there are conflicting data. Three, it provides a kind of type system where applications can rely on the same attributes of the same type of device like a battery. Fourth, it describes unique relationships among these energy assets, like which devices can talk to each other and who can control them.

It relies heavily on Postgres database to describe these relationships. We use Kafka to integrate changes as they happen for many of these different business systems where we stream the changes directly from IoT devices. At scale, actually, this is a lot more reliable. Devices are often the most reliable source of truth self-reporting their configuration, state, and relationships. Digital twin is the representation of a physical IoT device, a battery, an inverter, a charger in software modeled virtually, and we do a lot of digital twin modeling to represent the current state and relationships of various assets.

Finally, there are services for commanding and controlling IoT devices like telling a battery to discharge at a given power setpoint for a specific duration. Similar to both the telemetry and asset domains, we need a streaming stateful and real-time representation of IoT devices at scale, including modeling this inherent uncertainty that comes with controlling IoT devices over the internet.

Akka has been an essential tool for us for building these microservices. Akka is a toolkit for distributed computing. It also supports actor-model programming, which is great for modeling the state of individual entities like a battery, while also providing a model for concurrency and distribution based on asynchronous and mutable message passing. It's a really great model for IoT, and I'll provide some specific examples later in the presentation. Another part of the Akka toolkit that we use extensively is the Reactive Streams component called Akka Streams. Akka Streams provides sophisticated primitives for flow control, concurrency, and data management, all with backpressure under the hood, ensuring that the services have bounded resource constraints. Generally, all the developer rights are functions, and then Akka Streams handles the system dynamics, allowing processes to bend and stretch as the load of the system changes and the messaging volume changes.

The Alpakka project has a large number of these Reactive Streams interfaces to services like Kafka or AWS S3, and Alpakka is what we use for interfacing with Kafka extensively. We don't actually use Kafka streams because we find the interface there is too simplistic for our use case, and it's also ecosystem-specific. Akka Streams provide this much more general-purpose streaming tool.

Like any large platform, there is a mix of languages, but our primary programming language is Scala. The reason we came to Scala was through Akka because it's really the first-class way to use Akka. Then we really fell in love with Scala's rich type system, and we become big fans of functional programming for building large, complex distributed systems. We like things like the compile-time safety, immutability, pure functions, composition, and doing things like modeling errors as data rather than throwing exceptions.

For a small team, having a primary programming language where you invest in a deep understanding and first-class tooling is a huge boost to productivity, job satisfaction, and the overall quality of a complex system. Majority of our microservices run in Kubernetes, and the pairing of Akka and Kubernetes is really fantastic. Kubernetes can handle coarse-grained failures in scaling, so that would be things like scaling pods up or down, running liveness probes, or restarting a failed pod with an exponential back off.

Then we use Akka for handling fine-grained failures like circuit breaking or retrying an individual request and modeling the state of individual entities like the fact that a battery is charging or discharging. Then we use Akka Streams for handling the system dynamics and these message-based real-time streaming systems. The initial platform was built with traditional HTTP APIs and JSON that allowed rapid development of the initial platform. Over the past year, we've invested much more in gRPC. It's been a big win.

It's now our preference for new services, or if we extend older services, and it brought three distinct advantages. Strict contracts make these systems much more reliable. Code generation of clients means we're no longer writing clients, which is great. Third, and somewhat unexpected, we saw a much improved cross-team collaboration around these contracts. We're not just seeing this with gRPC because we also prefer protobuf for our streaming messages, including the ones that are going through Kafka. We maintain a single repository where we share these contracts and then collaborate across projects.

I've mentioned this theme of strict typing a few times, rich types in Scala, strict schema with protobuf, and then these strict asset models for systems integration. Constraints ultimately provide freedom and they allow decoupling of microservices and decoupling of teams. Constraints are really a foundation for reliability in large scale distributed systems.

Takeaways from building the Tesla energy platform: We were lucky to embrace the principles of reactive systems from day one, and this produced incredibly robust, reliable, and effective systems. Reactive Streams is really important component for handling the system dynamics and providing resource constraints while also providing this rich general-purpose API for streaming.

What's needed to build these complex services, especially in IoT is a toolkit for distributed computing. For us, that's been Akka. For others, that might be Erlang OTP. I think now we're also seeing the evolution of stateful serverless platforms to support the same building blocks. I kind of imagined that's how we're all going to be programming these systems in the future. That's things like managing state, modeling individual entities at scale, workflow management, streaming interfaces, and then allowing the runtime to handle concurrency, distribution, and failure.

Strict contracts make systems more reliable and allows services and teams to work more decoupled while also improving collaboration. Don't develop every microservice differently just because you can. Compound your investments in your knowledge and in your tooling by creating a deep understanding and also this paved path in your primary toolset.

Single Battery Market-Participation

Link: On top of the Tesla energy platform that Colin described, we built our first power plant type application. In this phase, we were learning how to productize real-time forecast and optimization of batteries. In this case, we started with a single, albeit very large battery, which was the Hornsdale battery. Hornsdale battery was built on a tight timeline because of this famous tweet. It's the largest battery in the world at 100 megawatts, 129 megawatt-hours, which is about the size of a gas turbine.

Hornsdale helps keep the grid stable, even as more renewables are coming online. Not only is it keeping the grid stable, it's actually reduced the cost to customers of doing so. It helps the grid by providing multiple kinds of services. During extreme events, like when a generator trips offline, Hornsdale responds nearly instantaneously to balance big frequency excursions that could otherwise cause a blackout. Even during normal times, whereas a conventional generator's response lags the grid operators signal by the order of minutes, the battery can follow the grid operator frequency regulation commands nearly instantaneously. This helps maintain a safe grid frequency.

This big battery provides these services to the grid by way of the energy market. Why do we have to participate in an energy market? Recall the requirement that supply and demand have to be balanced in real time. Markets are an economically efficient way to make them balance. Participants bid in what they're willing to produce or consume at different price levels and at different timescales. An operator activates the participants who can provide the necessary services at the lowest price. If we want the battery to continually provide services to the grid and help stabilize it as more renewables come online, we need to be participating in energy markets.

In order to participate in energy markets, we need software. To do this, we built Autobidder to operate Hornsdale and now we offer it as a product. This is the UI for Autobidder. It's a pro tool that's intended for control room operators who look at it day in and day out. There's a lot of information on the screen, I know, but it's running workflows that fetch data, forecast prices and renewable generation and decide on an optimal bid and then submit it. These workloads run every five minutes, that's the cadence of the Australian market.

At a high level, the optimization problem is trade-offs across different market products, which are different kinds of services that the battery can provide to the grid and trade-offs in time since the battery has a finite amount of energy. Autobidder is built in the applications layer of the Tesla energy platform that Colin [Breck] described. It consists of several microservices, and it interacts with both the platform and with third-party APIs. Autobidder is fundamentally a workflow orchestration platform. You might ask, why did we build our own rather than using an open-source tool. The key thing is this is operational technology. These aren't batched or offline jobs, and it's critical for financial and physical reasons that these workflows run.

We also leverage our primary toolset, so that allowed us to avoid introducing a new language and new infrastructure into our stack. At the center of the system is the orchestrator microservice, and this runs autobidding workflows and a principal we hold too as we keep this core as simple as possible and continue with complexity in the peripheral services. The market data service abstracts the ETL of complex input data. This data has diverse kinds of timing when it arrives relative to the market cycle. This service handles that timing, and it handles callbacks in case of read or writing data or missing data.

There's a Forecast service and optimization service that execute algorithmic code, and a bid service to interact with the market submission interface. The orchestrator, market data service, and bid service are written in Scala. Again, this common toolkit gives us great concurrency semantics, functional programming type safety, and compounding best practices across our teams.

However, the forecast and optimization services are in Python. This is because it's very important to us to enable rapid algorithm improvement and development, and Python gives us a couple of things there. There are key numerical and solver libraries available in Python, and also the algorithm engineers on our team are more fluent in Python. Having these services in Python empowers them to own the core logic there and iterate on it.

The communication between the market data, and bidding services, and orchestrator happens over gRPC, with the benefits Colin described as strict contracts, code generation, and collaboration. The communication between the Orchestrator, and the forecasting, and optimization services uses Amazon SQS message queues. These queues give us durable delivery, retries in cases of consumer failures, and easily support long-running tasks without a long-lived network connection between services. We use an immutable input-output messaging model, and the messages have strict schemas.

This allows us to persist the immutable inputs and outputs and have them available for backtesting, which is an important part of our overall team's mission. Also, SQS allows us to build worker pools. Like I said, forecast and optimization are in Python, which has somewhat cumbersome concurrency semantics. The message queue allows us to implement concurrency across workers, instead of within a worker. It's notable that these services are functions effectively. They take inputs and produce outputs without other effects. This keeps them more testable and makes these important algorithm changes and improvements safer, and also relieves algorithm engineers of the burden of writing I/O code, and it lets us use Scala concurrency for I/O.

Stepping back and looking at these workflows as a whole, these workflows are stateful. The state is a set of immutable facts that are generated by business logic stages. These stages happen sequentially in time. The workflow state includes things like knowing what the current stage is and accumulating the results of a task within the current stage and across stages. Some stages like the forecast stage have multiple tasks that need to be accumulated before deciding to proceed, and some stages might need outputs of multiple previous stages, not just the immediate predecessor.

In case of a failure, like the orchestrator pod restarting, we don't want to forget that a workflow was in progress, and we prefer not to completely restart it. We can instead take snapshots of the state at checkpoints. If the workflow fails, it can be resumed from the last checkpoint. We keep this state in an actor representing the workflow state machine. Akka persistence gives us transparent resumption of the state through checkpointing and an event journal.

An important lesson we've learned is to keep the business logic of stage execution and pure functions separate from the actor as much as possible. This makes testing and composition of that business logic so much easier, and the new archetyped API naturally helps with that decomposition.

On our team, it's very important to enable rapid development of algorithms and improvement in iteration and so we have Python in specific places in our system. We also really need to minimize the risk that the iteration on the algorithms breaks workflows. A couple of things that work really well for us to minimize that risk are an input-output model to the algorithmic services keeps that code simpler and more easily testable and strict contracts, which, again, gives freedom to change the algorithm internal logic independently of the rest of the system.

It's been important for us to abstract the messy details of external data sources and services from the core system. This is a fundamental tenant of the whole platform actually. These workflows are inevitably stateful, but entangling state with the business logic stages can lead to spaghetti code and instead keep the business logic stages functional, testable, and composable.

Peak-Shaving Virtual Power Plant

Breck: In the next part, let me describe our first virtual power plant application. Percy [Link] just described how we leverage the platform to participate in the energy markets algorithmically with one large battery. Now, we'll focus on how we extend that and use what we learned to measure, model, and control a fleet of thousands of power walls that are installed in people's homes to do peak shaving for an electrical utility.

Before I detail the software architecture, I'll describe the problem that we're trying to solve. This is a graph of aggregate grid load in megawatts. Grid load varies with weather and with time of year. This is a typical load profile for a warm summer day. The left-hand side is midnight, the minimum load is around 4 a.m. when most people are sleeping, and then peak load is around 6 p.m. when a lot of people are running air conditioning or cooking dinner.

Peak loads are very expensive. The grid only needs to meet the peak load a few hours in a year. The options for satisfying the peak load are build more capacity, which incurs significant capital costs, and then this capacity is largely underused outside of those peaks. The other option is to import power from another jurisdiction that has access, and this is often at a significant premium.

Power can be cheaper if we can offset demand and make this load curve more uniform, and that's our objective. We want to discharge power wall batteries during the peak grid load, and at other times, the homeowner will use this battery for clean backup power. A lesson we quickly learned as our virtual power plants grew to thousands of power walls and tens of megawatts of power was that charging the batteries back up right after the peak would lead to our own peak, defeating the purpose. Of course, the solution is to, not only control when the batteries discharged, but also when they charge and spread out the charging over a longer period of time.

This is what we're trying to accomplish, this picture, but in reality, we don't have the complete picture. There's uncertainty. It's noon, and we're trying to predict whether or not there's going to be a peak, and we only want to discharge batteries if there's a high likelihood of a peak. Once we've decided to discharge batteries to avoid the peak, how do we control them? I want to be very clear that we only control power walls that are enrolled in specific virtual power plant programs. We don't arbitrarily control power walls that aren't enrolled in these programs so not every customer has this feature.

As Percy [Link] mentioned, the grid's not designed to interact with a whole bunch of small players. We need to aggregate these power walls to look more like a traditional grid asset, something like a large steam turbine. Typically, we do this by having hierarchical aggregations that are a virtual representation in cloud software. The first level is a digital twin representing an individual site, so there'll be a house with a power wall. The next level might be organized by electrical topology, something like a substation, or it could be by geography, something like a county. The next level can again be a physical grouping, like an electrical interconnection, or it might be logical, like sites with a battery and sites with a battery plus solar that we want to control or optimize differently.

All of these sites come together to form the top level of the virtual power plant. Meaning we can query the aggregate of thousands of power walls as fast as we can query a single power wall and use this aggregate to inform our global optimization. It's easy to think of the virtual power plant as uniform, but the reality is more like this. There's a diversity of installations and home loads, some homes have one battery, some have two or three. The batteries are not all fully charged, some might be half full or close to empty, depending on home loads, time of day, solar production on that day, and the mode of operation.

There's also uncertainty in communication with these sites over the internet as some of them may be temporarily offline. Finally, there's the asset management problem of new sites coming online regularly, firmware being non-uniform in terms of its capabilities across the whole fleet and hardware being upgraded and replaced over time.

It's really critical to represent this uncertainty in the data model and in the business logic. We want to say things like, there's 10 megawatt-hours of energy available, but only 95% of the sites we expect to be reporting have reported. It's really only the consumer of the data that can decide how to interpret this uncertainty based on the local context of that service. One way we manage this uncertainty is through a site-level abstraction. Even if the sites are heterogeneous, this edge computing platform provides site-level telemetry for things like power, frequency, and voltage that gives us a consistent abstraction and software.

Then another way is to aggregate the telemetry across the virtual power plant. People don't want to worry about, you know, controlling individual power wall batteries, they want to worry about discharging 10 megawatts from 5 p.m. to 6 p.m. in order to shave the peak. This is a really difficult engineering challenge, which is a combination of streaming telemetry and asset modeling. For modeling each site in software, this so-called digital twin, we represent each site with an actor. The actor manages state, like the latest reported telemetry from that battery and executes a state machine, changing its behavior if the site is offline and telemetry is delayed. It also provides a convenient model for concurrency and computation.

The programmer worries about modeling an individual site in an actor, and then the Akka runtime handles scaling this to thousands or millions of sites, and you don't have to worry about that. It's a very powerful abstraction for IoT in particular, and we generally never worry about threads, or locks, or concurrency bugs. The higher-level aggregations are also represented by individual actors, and then actors maintain their relationships with other actors describing this physical or logical aggregation. Then the telemetry is aggregated by messaging up the hierarchy in-memory in near real-time, and how real-time the aggregate is at any level is really just a trade-off between messaging volume and latency.

We can query at any node in this hierarchy to know the aggregate value at that location or query the latest telemetry from an individual site. We can also navigate up and down the hierarchy from any point. The services that perform this real-time hierarchical aggregation run in an Akka cluster. Akka cluster allows a set of pods with different roles to communicate with each other transparently. The first roll is a set of linearly scalable pods that stream data off Kafka, and they use Akka Streams for backpressure, bounded resource constraints, and then low latency stream processing.

Then they message with a set of pods running all the actors in this virtual representation that I just described. When the stream processors read a message off Kafka for a particular site, they message to the actor representing that site simply using the site identifier. It doesn't matter where in the cluster that actor is running, the Akka runtime will transparently handle the delivery of that message. This is called location transparency. Then site actors message with their parents in a similar manner, all the way up that hierarchy.

There's also a set of API pods that conserve client requests for site level or aggregate telemetry because they can query into the cluster in this same location transparent way. It's this collection of services that provides the in-memory, near real-time aggregation of telemetry for thousands of power walls. It's an architecture that provides great flexibility, especially when paired with Kubernetes to manage the pods because the actors are just running on this substrate of computer. They're kind of running on the heap, if you will.

An individual pod can fail or be restarted, and the actors that were on that pod will simply migrate to another until it recovers and the runtime handles doing this. The programmer doesn't have to worry about it. Or the cluster can also be scaled up or down, and the actors will rebalance across the cluster. Actors can recover their state automatically using Akka persistence. In this case, we don't actually need to use Akka persistence, because the actor can just rediscover its relationships as well as the latest state when the next message from the battery arrives within a few seconds.

To conclude this section, after aggregating telemetry to know the capacity that's available in the virtual power plant, let's look at how the batteries are actually controlled. The first step is taking past measurements, forecasting, and deciding how many megawatts to discharge if we are going to hit a peak. At a high level, this loop of measure, forecast, optimize, and control is basically running continuously. The control part of this loop is true closed-loop control.

Once an aggregate control setpoint has been determined, we continuously monitor the disaggregate telemetry from every single site to see how it responds, and we adjust the setpoint for the individual sites to minimize error. We can take a look at how this works. The Autobidder platform that Percy described may decide to control the whole fleet. To give a sense of scale, this might be enough megawatts to offset the need to build a new natural gas peaker plant. Or we might just decide to control a subset of the hierarchy depending on the objective.

The control service that I mentioned earlier dynamically resolves the individual sites under this target by querying the assets service, and this is because the sites can change over time. New sites are installed, the virtual hierarchy might be modified, or the properties of an individual site might change, maybe you add a second battery. The control service queries the battery telemetry at every site, potentially thousands of sites using the in-memory aggregation that I just discussed to decide how to discharge the battery at each site. There's no point discharging a battery that's almost empty. You can think of this as somewhat similar to a database query planner, basically trying to plan the execution.

The control service then sends a message to each site with a discharge setpoint and a timeframe, and it will keep retrying until it gets an acknowledgment from the site, or the timeframe has elapsed. Because these logical aggregations of batteries are so large, we stream over these huge collections using Akka Streams to provide bounded resource constraints in all of the steps that I've just described. That's resolving the sites, reading all of the telemetry, and then sending all the control setpoints.

Huge aggregations demand different APIs and data processing patterns. You can't just go build typical crud microservices, not going to work. You need streaming semantics for processing large collections with low latency and bounded resource constraints. What we really need is a runtime for modeling stateful entities that support location transparency, concurrency, scaling, and resilience.

Uncertainty is inherent in distributed IoT systems, so we need to just embrace this in the data model, in the business logic, and even in the customer experience rather than trying to escape it. Representing physical and virtual relationships among IoT devices, especially as they change over time is the hardest problem in IoT, trust me, but essential for creating a great product.

Direct control based on a central objective doesn't account for local needs, and this creates a kind of tension. Imagine a storm is approaching close to a peak. The global objective wants to discharge these batteries to avoid the peak, but, of course, the homeowner wants a full battery in case the power goes out. This leads to the final part of our presentation, the co-optimized virtual power plant.

Market-Participation Virtual Power Plant

Link: Just to review where we are, so far we've built on the fundamental platform to, first of all, optimize a single big battery to participate in an electricity market, and then second aggregate, optimize, and control thousands of batteries to meet a central goal. In this last section, like Colin [Breck] said, again going to aggregate, optimize, and control thousands of batteries, but this time not just for a global goal. We're going to co-optimize local and global objectives.

Whereas the peak shaving virtual power plants that Colin [Breck] just described optimized a central objective and pass the control decisions downward to the sites, the market virtual power plant distributes the optimization itself across the sites in the cloud. The sites actually, in this case, participate in the control decisions.

This distributed optimization is only possible because Tesla builds its own hardware and has full control over firmware and software. This enables quick iteration across the local and central intelligence and how they relate to each other, and this collaboration is cross-team rather than cross-company.

When we say that this virtual power plant co-optimizes local and global objectives, what do we mean? Let's take a non-virtual power plant home, a home with this solar generation, and this electricity consumption would have net load like this. This is the load that the utility sees. With a power wall home battery, the power wall can charge during excess solar generation and discharge during high load. This is thanks to the local intelligence on the device. The goal of this would be either to minimize the customer's bill or to maximize how much of their own solar production they're using. This is local optimization.

What does it look like to co-optimize local and global objectives? One way to do it is that the local optimization can consider information about the aggregate goal like market prices indicating the real-time balancing needs of the grid. In this example, negative prices in the night, perhaps caused by wind over-generation, might cause the battery to charge earlier, and a high price in the afternoon caused maybe by unexpectedly high demand prompts the battery to discharge rather than waiting to fully offset the evening load like it would have. Just to note that this is all well following local regulations around discharging to the grid.

In our co-optimized virtual power plant, Autobidder generates a time series of price forecasts every 15 minutes, and the Tesla energy platforms control component distributes those forecasts to the sites. The local optimization then runs, make the plan for the battery given both the local and global objectives. Then the sites communicate that plan back to the Tesla energy platform, which ingests and aggregates it using the same framework that ingests and aggregates telemetry. Autobidder then uses the aggregate plans to decide what to bid.

This distributed algorithm has a couple of big advantages. One is scalability. We're taking advantage of edge computing power here, and we're not solving one huge optimization problem overall sites. As more sites join the aggregation, we don't have to worry about our central optimization falling over.

Another big advantage is resilience to the inevitable intermittency of communication. When sites go offline for short or moderate amounts of time, they have this last version received of this global time series of prices, and they can continue to co-optimize using the best estimate of the global objective. Then if the sites are offline for longer than the length of that price time series, they just revert to purely local optimization. This is really reasonable behavior in the case of degrading connectivity. It's still creating local value for the local site.

Then from the other perspective of the server, the telemetry aggregation accounts for offline sites out of the box. If sites haven't reported signals in a certain amount of time, they're excluded from the aggregate. Then Autobidder is able to bid conservatively and assume that offline sites are not available to participate in market bids.

Tesla's unique vertical hardware, firmware, software integration enables this distributed algorithm. The vertical integration lets us build a better overall solution. This distributed algorithm makes the virtual power plant more resilient. Devices are able to behave in a reasonable way during the inevitable communications failures of a distributed system.

This algorithm is only possible because of the high quality and extensible Tesla energy platform that embraces uncertainty and models reality. At the same time, the algorithms help the software platform. The algorithms enhance the overall value of the product.

In our journey building the Tesla energy virtual power plant, we found it very true that while the algorithms are obviously important to the virtual power plant success, the architecture and reliability of the overall system are the key to the solution.

Breck: It's this system that allows us to provide reliable power to people who have never had it before.

Link: Balance renewables on the grid.

Breck: Provide flexible energy solutions for disaster relief.

Link: Build highly integrated products and services that deliver a superior customer experience.

Breck: We're working on a mix of the most interesting and challenging problems in distributed computing, as well as some of the most challenging and interesting problems in distributed renewable energy. We're hiring if you want to work on these challenging important problems with us, of course.

Link: Equally importantly, software has the potential to address many of the most pressing problems in the world, from renewable energy and climate change to food and agriculture to cancer and infectious disease research.

Breck: Let's take our talents in software engineering and work on the most important and lasting problems that we can find.


See more presentations with transcripts


Recorded at:

Mar 23, 2020