API Architecture for Demand Response Aggregators: Patterns and Pitfalls -- Derapi

Demand response aggregators sit in an interesting architectural position. On the upstream side, you are receiving dispatch signals from utilities or ISOs via OpenADR or direct API integrations. On the downstream side, you are sending control commands to thousands of customer-side DER devices through a zoo of device APIs, protocols, and connectivity patterns. The API architecture that connects these two sides determines whether your platform can actually perform when a real curtailment event fires.

This piece covers the architectural patterns that work for aggregator platforms and the failure modes we have seen in platforms that did not think through the architecture early enough.

The core architectural challenge

Demand response events have two defining characteristics that shape the architecture: they are time-critical and they require fan-out.

Time-critical: a utility curtailment signal arrives with a start time that may be minutes away. Your platform needs to process the signal, translate it into device-specific commands, and dispatch those commands before the event window opens. Latency anywhere in this pipeline costs you compliance margin.

Fan-out: a single curtailment event may need to dispatch commands to 5,000 customer devices simultaneously. Your dispatch layer needs to handle that concurrency without queuing delays that cause some devices to receive commands late or not at all.

These two requirements — low latency and high concurrency — point to the same architectural conclusion: synchronous, request-response dispatch does not work. You need an asynchronous, event-driven dispatch pipeline.

Pattern 1: Event-driven dispatch pipeline

The architecture that handles demand response fan-out reliably uses a message queue as the backbone. When an event signal arrives from the VTN, the platform:

Parses the OpenADR event and translates it into a normalized internal event format
Queries the enrollment database to find all devices enrolled in the active program
Publishes one dispatch message per device to a message queue
Worker processes consume the queue and send device-specific API commands
Results (success/failure per device) are written to a results store
Reporting pipeline aggregates results and sends compliance reports back to the VTN

The queue is the key element. Publishing 5,000 messages to a queue takes milliseconds. Processing those messages with a pool of workers can happen in parallel, with each worker handling a subset of devices. The total dispatch time scales with the size of the worker pool, not with the total device count.

Pattern 2: Pre-computed device command templates

One of the latency bottlenecks in demand response dispatch is the translation step: taking a normalized curtailment instruction and converting it into a device-specific API call for each vendor's format. If this translation happens at dispatch time, it adds latency and computation to the critical path.

A better pattern: pre-compute command templates at enrollment time. When a device is enrolled in a demand response program, the platform computes the device-specific command payload for each possible event type (50 percent curtailment, full curtailment, test event) and stores the pre-computed payloads alongside the device record. At dispatch time, the worker retrieves the pre-computed payload and sends it directly.

This trades storage for latency — usually a good trade when event response time matters.

The settlement reporting architecture

Demand response programs pay on reported performance, not on dispatch attempts. Your architecture needs to collect and aggregate telemetry from enrolled devices during and after events to generate the settlement reports that determine your customers' DR payments.

The telemetry collection layer is often an afterthought in aggregator architectures, and it shows up as a problem during the first settlement period. Device telemetry arrives at different rates, in different formats, with different time resolutions. Solar inverters might report 15-minute intervals; smart thermostats might report every 5 minutes; some devices might batch-report after the event ends rather than streaming during it.

Device type	Typical report cadence	Data format
Solar inverter (cloud API)	15-min intervals	Vendor REST API
Smart thermostat	5-min intervals	Vendor webhook or poll
EV charger (OCPP)	Per-session, configurable	OCPP MeterValues
Battery storage (IEEE 2030.5)	15-min intervals	IEEE 2030.5 MirrorMeterReading

Your settlement pipeline needs to normalize this heterogeneous telemetry into a unified baseline-vs-actual comparison for each device, aggregated to the program's required reporting granularity. Build the telemetry ingestion architecture before you build the dispatch layer — settlement is what you get paid for.

The enrollment state management problem

Aggregator platforms maintain enrollment state for every device in every program they manage. This state is surprisingly complex: a device can be enrolled in multiple programs simultaneously, each with different curtailment logic; enrollment can be suspended (device temporarily unavailable), overridden (customer opted out of a specific event), or terminated. Events can happen when some enrolled devices are offline.

Enrollment state bugs are the hardest category of aggregator platform bugs to diagnose, because they tend to produce incorrect dispatch behavior that only manifests during live events, not in normal testing.

The architectural mitigation is to treat enrollment state as an event-sourced store rather than a mutable record. Every change to enrollment state (enrollment, suspension, opt-out, termination) is recorded as an immutable event with a timestamp. The current state is derived by replaying the event log. This makes it possible to reconstruct exactly what the enrollment state was at any past moment — which is essential for investigating why a device did or did not receive a dispatch command during a specific event.

The common failure modes

The patterns above are not hypothetical — they are responses to specific failure modes we have observed in aggregator platforms that did not implement them.

Synchronous dispatch loops that iterate over enrolled devices sequentially, sending one API call at a time, creating dispatch latency that grows linearly with fleet size. A platform that dispatches 500 devices sequentially at 100ms per API call takes 50 seconds to complete dispatch. If the event window starts in 30 seconds, half the fleet misses the event start.

Telemetry collection built as a post-event batch job rather than a streaming pipeline, creating situations where a device API outage during the settlement window causes missing telemetry that cannot be reconstructed retroactively.

Enrollment state stored as mutable records without audit history, making it impossible to explain why a device did not receive a dispatch during an event that occurred three months ago and is now in settlement dispute.

The demand response aggregator market is growing as grid operators rely more on flexible demand to balance variable renewable generation. The platforms that will scale into that growth are the ones with the underlying architecture to handle thousands of enrolled devices with sub-minute dispatch latency and reliable telemetry collection. Build for that architecture early, even if your current enrollment count is small.