Buffering services with the help of a queue is very common. In Microsoft’s Cloud Architecture Patterns, it’s called the queue-based load leveling pattern, Yan Cui calls it Decoupled Invocation, and Jeremy Daly calls it the Scalable Webhook.
A queue decouples tasks from services, creating a buffer that holds requests for less scalable backends or third-party services
Regardless of the volume of requests the processing load is driven by the consumers. Low concurrency and batch size can control the workload.
“Load can be unpredictable, some services cannot scale when there’s an intermittent heavy load. Peaks in demand can cause overload, flooding downstream services and causing them to fail.
Introducing a queue between the services that acts as a buffer can alleviate the issues. Storing messages and allowing consumers to process the load at their own pace.”
In the example, we have a Migration worker process that reads the content of an Elasticsearch index. Elasticsearch it’s fast. The worker can read thousands of Articles and fetch their dependencies (Authors, Categories, Tags, Shows, Assets, etc.) in less than a second.
On the right side, we have a service that needs to ingest all the content, but before we can create an Article we have to create all its relationships in a specific order, double-check if they exist, and need to be updated which is slower. Even if we scale horizontally the service — and we did — the relational database behind it becomes the bottleneck.
After a point (~100k to 500k Articles) querying the database slows down to a crawl because there’s some locking on the Has-and-Belongs-to-Many relationship tables.
By limiting the batch size and the number of workers running concurrently we can maintain a slow but steady flow that reduces lock contention in the database.
Another common example is using a SQS to buffer API requests to amortize spikes in traffic— like in the diagram above.
The endpoint returns 202 — Accepted to the client, with a transaction ID and a location for the result. On the client side, the UI can give feedback to the user emulating the expected behavior.
The service can process the requests in the background at its own pace. Even if there are long-running processes involved an increase in the load on the client side will never affect the throughput and responsiveness of the system.
This pattern is not helpful when services require a synchronous response (waiting for a reply). It’s important to note that not using concurrency limits can diminish the effectiveness of the pattern: AWS Lambda and Kinesis can scale quickly, overwhelming downstream services that are less elastic or slower to scale. Zalando’s API Guidelines include a full section about Events that talks about some important considerations for this pattern.