Distributed Systems Horror Stories: The Thundering Herd Problem

Let me take you all the way back to April 2018. I was working at a startup that was about to ship a heavily requested new feature. We'd captured interest in the feature by building a waitlist but we were being pretty secretive about the launch date (mainly because we didn't know when development would be finished).

After shipping the last features to make it MVP complete, we did some very light testing to ensure all of the user journeys worked as we expected. Things were looking good and after discussing it with our product manager, we made the decision to enable it for all customers. The excitement of shipping our first big feature in a while got the better of the entire company and before we knew it, push notifications had been sent to all customers and the entire waitlist had been enabled.

We did not expect the influx of traffic that we started to receive, and we were soon seeing alerts for services crashing, due to an unhandled panic we had not covered. After fixing, we redeployed it and for a brief moment, all our alerts were resolved and we could relax. This relaxation did not last though, as 5 minutes later a whole different set of alerts started firing (high latency, high CPU usage). It was time to enter incident mode again.

The Thundering Herd Problem

Unfortunately, we had just experienced the thundering herd problem first hand.

A thundering herd incident for an API typically occurs when a large number of clients or services simultaneously send requests to an API after a period of unavailability or delay. This could be caused by either services that you own or third party services retrying requests after a period of downtime or instability.

These requests are often not malicious and are caused by engineers trying to do the right thing; one of the first lessons we get taught when working on distributed systems is that networks are not reliable and we should assume that they are going to fail. This is good advice, but how we handle that failure is important.

Resolving the immediate Incident

To resolve the immediate incident, we need to either reduce the number of requests, or scale the amount of requests our API can handle (or preferably both). To do this you can:

Scale your API horizontally: Be sure to keep your eye on your Database health if you choose this option, especially if you have not scaled your API as horizontally in the past. If you are not splitting read and write requests or connection pooling, you might find you end up with a database incident too!
Scale your API vertically: This might not always work, but there is typically some relationship between CPU, memory and the amount of request your application can process.
Scale down less critical workloads: It might be that there are some customer journeys we can consciously impact in the short term whilst we get ourselves back to a more stable place. For example, you might decide not allowing new customers to sign up in this period is a good approach. For other companies, this might be the worst thing imaginable and sign ups are the critical journey we must ensure continues to happen. Be sure to have these discussions at your company if you have not had them.

Ensuring it Doesn't Happen Again

We scaled our API up, we scaled other services down, and now we are back to a place where we are operating in a business as usual mode. The stress has not gone completely though, as we know this could happen again at any time. What can we do to prevent that? We need to tackle it from a couple of angles. Let's step through them, starting with changes necessary to the API.

Rate-Limiting

Firstly, we should consider adding per client rate-limits. You may do this at the infrastructure layer (many load balancers or reverse proxies support this) or you may choose to do it in your application. Either way, giving clients an allowance of how many requests they can make per minute is a useful way to protect your API. Deciding per application rate-limits is an imperfect science and will require some experimentation. I recommend that your API returns the Retry-After header and clients can use this to determine how long to "back-off" for. The nice thing about this approach is that you can encourage exponential-back off behavior for clients.

Caching

There may be an opportunity for you to implement caching in your API to prevent it having to do expensive DB lookups or calculations. This won't be suitable for all APIs, and there is a saying that "if you solve a problem with caching you now have one problem" which is always worth keeping in mind whenever you reach for it as a solution. AWS has a great article on this called "the ecstasy and agony of caching" which I highly recommend reading.

Circuit-Breaking

You could consider adding a circuit breaker to your API so that once you start to see a surge in requests, you open the circuit and prevent more requests being handled. For Go specifically, there are some open-source libraries you can use here such as sony/gobreaker or the one provided by go-kit.

Alerts

In the instance we described above, we actually did have alerts setup which allowed us to catch the issue, but the damage was already done. Having earlier visibility into problems occurring could have led to taking action to scale vertically or horizontally to prevent it becoming an issue. Even better if we could use those early warning signs as a signal to begin scaling and auto-resolve an incident before it happens!

Client Changes

For the clients calling your API, you will need to work with them to ensure they respond to your rate limit successfully. Furthermore, in the event that they receive a failure response (something in the 5xx range), you may want to consider asking them to implement exponential backoff with a little bit of jitter, to ensure that retried requests are "spread out" in the event of a downtime. Sam Rose has made an excellent visualization of retry patterns which should help you see the impact that different strategies might have on your API.

As with the API, caching could be something to consider here. Does the client need the freshest information? Could storing some of the API responses locally reduce load to the API and improve customer experience?

Infrastructure Changes

You could take some of the lessons learned from this incident and look to make changes to your infrastructure to protect other services from suffering the same fate. You could consider implementing rate limits and load shedding on your API Gateway. I particularly like this blog post from Stripe which talks about how they handle this. As aforementioned, you could also implement rate-limiting at a higher level. Circuit-breaking is another pattern to consider applying at the infrastructure layer too.

Wrapping Up

The thundering herd problem is a tough one to deal with if you have not got any strategies in place to handle it at the time of the event. As with all distributed system issues, foresight is your friend. If you have not had discussions about which customer journeys are the most important, you should have those now.

Furthermore, ensuring your API has the ability to impose rate-limits and that clients are using some form of exponential back off will set you up for success if you do ever find yourself in a position where your API is overwhelmed.

I hope you found this blog post useful and it helps you tame your herd!

About The Author

Matthew Boyle is an experienced technical leader in the field of distributed systems, specializing in using Go.

He has worked at huge companies such as Cloudflare and General Electric, as well as exciting high-growth startups such as Curve and Crowdcube.

Matt has been writing Go for production since 2018 and often shares blog posts and fun trivia about Go over on Twitter.

He's currently working on a course to help Go Engineers become masters at debugging. You can find more details of that here.

The Thundering Herd Problem

Distributed Systems Horror Stories: Part Two

The Thundering Herd Problem

Distributed Systems Horror Stories: Part Two

The Thundering Herd Problem

Resolving the immediate Incident

Ensuring it Doesn't Happen Again

Rate-Limiting

Caching

Circuit-Breaking

Alerts

Client Changes

Infrastructure Changes

Wrapping Up

About The Author

More Articles