Customer-sited renewable resources offer tremendous promise for lowering costs, improving reliability, and reducing carbon footprint on the electric grid. But operators struggle to take advantage of the various forms of flexibility, having found it challenging to get multiple resources and increasingly complex systems to work well together. A distributed control environment, in which some assets are under direct control and others are not, adds additional complexity.
A potential grid of the future is one in which a large number of participants act in harmony, aligning their capacities and constraints to make optimal use of decarbonized power—increasing system demand at times when clean energy is plentiful and reducing it at times when it is not. We see early steps toward this in many places, from big battery projects balancing rooftop solar in South Australia, to pilot projects such as Google’s rollout of systems which link data center power consumption to the carbon intensity of the energy supply. A common element is the role of software in enabling these diverse components to act in concert.
The internet giants themselves provide an example of how this might work at scale. As part of Google’s pioneering systems engineering team, we helped build and operate the internal software systems which provide real-time monitoring and control of software systems with millions of participants, at sub-second latency and up to 99.999 percent reliability. This cloud-based model of distributed computing makes use of up to millions of servers, working in parallel, to perform extremely complex tasks. This transition, from a model in which work was done by a small number of hardened servers to one in which work can be performed across millions of unreliable machines, carries some lessons for the potential evolution of the grid.
Reliability Is a Property of the System, Not the Components
One of the core challenges in transitioning to a grid that relies on distributed energy resources is trusting the performance of resources which may have unreliable connectivity and may not be under direct operational control. This is a common design challenge shared with today’s large-scale software systems, particularly ones that use a microservices architecture, where work may be distributed across many owners, including third parties.
This distributed computing approach relies on system-level design to mitigate the failures of individual components. It originated with Google’s design philosophy of reliable software on unreliable hardware, which replaced mainframe-style hardened enterprise servers with large numbers of cheap commodity servers (often so cheap that they were built out of components with known hardware flaws). At Uber, for example, systems challenges included maintaining resilient communications with unreliable endpoints (drivers, who might go into a garage or through a tunnel). At SpaceX, flight system telemetry must run autonomously based on artificial intelligence and machine learning estimates of estimated state if connectivity is lost. Some of these techniques can be brought to bear in the grid space, providing tools for managing the next generation of grid challenges.
The transition from managing a system with hundreds of nodes to one with thousands or millions isn’t simply a more complex management problem—it’s a fundamentally different one, and it can take advantage of the law of large numbers. A single rooftop solar array without a production meter might produce unknown amounts of power backfeeding into the grid, but several hundred arrays observed over a period of months or years produce an output curve consistently correlatable with capacity, weather, and modeled demand. Likewise, several dozen behind-the-meter demand resources might produce a complex set of behaviors; but thousands can produce a stable, controllable virtual power plant capable of bidding into markets which support it, as in Europe and Texas.
Centralized Coordination, Decentralized Control
A foundational requirement for reliably managing workload across large numbers of small workers is a hierarchical control model, which provides a balance of central coordination with decentralized, semi-autonomous control. In a distributed systems software model, the role of central coordinator is provided by software load balancing, which does a real-time assignment of workload across available resources (servers or data centers), while optimizing for costs and for constraints (such as maintaining sufficient capacity and managing network bandwidth).
In this model, the assigned resources can also use the same technology to balance work across multiple internal participants—similar to the relationship between a microgrid and a distribution grid, in which the microgrid can manage resources locally, but act as a single entity on the larger grid. Work is also assigned asynchronously. For example, an Uber driver can still deliver passengers even if they are only occasionally connected to the dispatch system to receive work assignments.
The aspects of control that must be centralized govern the goals and the behavior of the system at scale. Regulating up or down happens at the level of an individual device, but shaving coincident peaks is a system-level goal. System-level control also needs to account for system-level dynamics such as lockstep effects and feedback loops. For example, two batteries responding to a price signal might be helpful, while 2,000 might pose a big problem.
Incremental Steps Toward Dynamic System Operation
The core lesson of distributed computing is that the whole can be greater than the sum of its parts. A system composed of five unreliable devices is unreliable; a well-designed system composed of 500 or 5,000 can operate at near-perfect levels of reliability. Comprehensive, synchronous control is not required to achieve this goal—but some form of system-level coordination is required to guide participants into mutually reinforcing behavior, and to avoid negative systemic effects such as feedback loops.
The other requirement for effecting system-level change is comprehensive visibility. In a small system, it’s possible to predict and directly observe the behavior of each participant, but in a system with many participants, operators need tools to aggregate information and provide insight at the system level. This enables them to understand whether changes—like adding new control systems—are having the desired effect. For today’s grid, this implies that the first step is adding visibility to current operations—visibility that needs to extend across both the current network and any future participants.