Dark launching real-time price alerts

7 min readSep 14, 2021

We have developed a Life Cycle Management (LCM) application for structured products. The LCM provides investors with a transparent view across market and risk data. Exposing data insights can help our customers with their decision-making. One key insight is the daily price fluctuation.

The price of a structured product fluctuates throughout the day. It is important that investors can monitor the prices of their products and get notified when the daily fluctuation is greater than a threshold. This allows investors to react accordingly to the ever-changing financial markets.

In an Life Cycle Management (LCM) platform for structured products, users can create recurring daily price fluctuation alerts for a single product, group of products, or all of the products in their portfolio. For example; a user should be immediately notified if product A, B, or C has a daily price fluctuation greater than 5%.

We decided to dark launch this feature due to the potential performance impacts. As described by author Martin Fowler ...

Dark launching a feature means taking a new or changed back-end behavior and calling it from existing users without the users being able to tell it’s being called. It’s done to assess the additional load and performance impacts upon the system before making a public announcement of the new capability.

Stock Market Connectivity — Consuming real-time price data

We consume near real-time structured product prices through our market data providers' HTTP/2 enabled server-sent event push API. For each product, a long-lived SSE stream is established, with a price pushed to us every 5 seconds.

cURL —Server-Sent Event streaming

As the server supports HTTP/2, each stream can be multiplexed over a single-origin TCP connection. Using fewer connections offers many performance benefits, such as reducing the TCP protocol overhead, memory and generally improving network utilization. The server can indicate the maximum number of concurrent streams per TCP connection with the SETTINGS_MAX_CONCURRENT_STREAMS parameter. This is sent inside a settings frame.

Google Developers — An Introduction to HTTP2

This poses several technical challenges:

Maintaining healthy connectivity of thousands of long-lived streams over HTTP/2.
Consumption of high throughput data. We need to consume thousands of concurrent streams, each with a price pushed every five seconds. Data is only consumed when the products’ listed stock exchange is open.

High-Level Design

Given the high number of streams required and events consumed per second, we felt a natural fit was a non-blocking architecture based on project reactor and netty. Reactor’s operators and schedulers have a low memory footprint and can sustain high throughput rates, on the order of 10’s of millions of messages per second.

As the events are consumed from the SSE API, we publish them to a GCP PubSub Topic with a single subscription. Competing consumers are then able to consume from the subscription.

Configuring Spring Boot WebClient for HTTP2

In order to leverage the benefits of HTTP/2 multiplexing within Spring Boot WebClient, we must first configure the underlying HTTP client.

We chose WebClient (based on Reactor Netty) as our preferred HTTP client due to its non-blocking IO high throughput capabilities. Many other traditional clients would not perform well with the number of streams required. To enable HTTP/2, set the protocol to H2.

Spring Boot WebClient HTTP/2 example

Validating connections & streams

Application-level:

Each new stream will be debug logged.

Opening new Http/2 streams

Container networking

Netstat is a Linux command-line networking tool that can be used to list the container’s established TCP connections. This can be used to ensure that the expected number of TCP connections are open.

Netstat: Active TCP connections

Above, we can verify that two streams are sharing one TCP connection.

Filtering & Caching

The ingested prices are used for a variety of features across the platform. For this reason, we decided to keep caching simple and utilized Flux’s distinctUntilChanged() filter. This significantly reduced traffic by filtering out sequential duplicates. The keys (the distinct price that was last seen) are garbage collected when the subscriber is canceled.

Reactor Core: Flux#distinctUntilChanged()

As our platform grows, we will likely need to introduce more complex caching and a highly available in-memory cache such as GCP MemoryStore or Redis.

Deployment

In order to minimize the load on our system, we open one stream per product. This introduces complexity whilst designing an application for high availability. We considered the tradeoffs and complexity of a multi-instance solution, but at present, it is appropriate and clean to deploy it as a single instance.

Kubernetes RollingUpdate

We currently deploy the application with the Kubernetes RollingUpdate strategy with a maxSurge of one. This results in the system temporarily opening duplicate product streams.

K8 RollingUpdate deployment strategy

During this time, GCP PubSub buffering ensures that the solution is resilient to bursts of traffic. Idempotent consumers simply discard duplicate events.

Future Improvement

A future improvement would be to split the streams across multiple instances. This means if one instance dies, then not all streams will have an outage.

Price fluctuation engine

Competing consumers consume the price events from the GCP PubSub Subscription. In our case, we use Spring Cloud PubSub with a scalable multi-instance application.

Keeping subscribers healthy

In order to ensure efficient processing of the backlog, we must ensure that subscribers are keeping up with the inflow of messages. This can be identified by a variety of metrics, dashboards, and autoscaling.

One symptom of a growing backlog problem is if the oldest_unacked_message_age and num_undelivered_messages are growing in tandem.

Additionally, we can compare the number of published messages per second vs acknowledged messages per second. If there is a growing gap, we can scale up the subscription instances or threads. This can be easily visualized with a Google StackDriver monitoring dashboard.

Published vs Acknowledged messages (via StackDriver dashboard)

If messages cannot be processed in 2 retry attempts, then they will be moved to a dead letter subscription.

We also monitor and identify any problematic SQL queries with Google Cloud SQL Query Insights.

Calculation

The core price fluctuation calculation is idempotent and performed with Jooq and PostgreSQL. So far, this has proved extremely performant with minimal resources, but depending on future throughput and business requirements we may move to an in-memory structure. Identified price fluctuations are then sent for further asynchronous processing, for example delivering necessary user notifications.

Deployment and Release Strategy

Dark Launching into production

We decided to deploy the new backend services and infrastructure to production without notifying our users or deploying the user interface. This was done in order to assess the additional load and performance impacts on the system before the feature was officially released.

The new features were guarded by a feature toggle. This meant that if we encountered any performance issues, the features could be turned off before anyone noticed.

Issues identified

Denial of service

We noticed that if we attempted to open many streams in a very short period of time, our market provider would start to reject requests. We learned from our partner that they were encountering a partial infrastructure DoS. After some collaboration, we agreed to gradually ramp up the number of established streams, while they started to work on the root cause fix. This was extremely important, as it would have caused an outage for both teams.

Streams

When our partner’s server encountered a graceful shutdown, many of our streams would disconnect and not be immediately re-established. This was an interesting observation, as we were primarily focused on handling and testing failure scenarios (such as broken, terminated connections, bad responses, etc.) and overlooked a simple graceful stream completion.

GCP PubSub Flow Control & Reactor BackPressure

GCP PubSub became temporarily latent. The messages were being consumed from the SSE faster than the GCP publisher could publish. This resulted in a large backlog of messages being queued for the publishing threads. Shortly after, the server crashed with an OutOfMemoryError.

In order to prevent an out-of-memory error, we combined Spring Cloud PubSub flow control with reactive streams backpressure.

Spring Cloud GCP PubSub flow control example

Publishing to GCP PubSub is async and returns a CompleteableFuture. But upon reaching the flow-control limit, the calling/requester thread will block before the CompletableFuture is returned. The thread is unblocked when capacity becomes available.

In our application, the requester thread executes within a parallel scheduler. A prefetch value is a limit to apply to downstream’s backpressure. If the publisher is producing more elements than can be prefetched, then a backpressure signal is emitted to rate-limit the publisher.

If the subscriber thread is blocked due to a flow-control limit. Then the subscriber will stop preFetching and the publisher will start to gather a backlog of elements. When the backlog size is greater than prefetch, then a backpressure signal is emitted.

Rate limit the upstream publisher

An upstream publisher can handle a backpressure signal in various ways; buffer, drop, error, or keep the latest element. In our use case, it is acceptable to drop elements until the downstream subscriber starts to request again.

Publisher backpressure — Drop elements

Heap Memory

After some profiling of the JVM, we noticed that we could better utilize our total memory by increasing the maximum heap size from 25% to 40%.

We use google container tools JIB. This ensures we have fast, daemonless, and reproducible application builds. JVM flags can be easily passed with the JIB maven plugin.

Pass JVM flags with Google JIB

Take-away

Dark launching proved extremely valuable as it identified a number of key issues that may have resulted in an early outage of the feature.

This enabled us to have a smooth release :)

Happy coding

Philip