Introduction
Thundering herd is a very interesting problem that can occur in a distributed system. Consider this scenario, you are a restraurant owner and you launch a new dish which is very popular among your customers. You have a large number of customers who want to try the new dish, and they all arrive at the same time. You have only one chef who can cook the dish, and you have only one kitchen where the dish can be cooked. When the customers arrive, they all try to place their order at the same time, and the chef becomes overwhelmed with the number of orders. The kitchen becomes very crowded, and the chef is unable to cook the dish fast enough to keep up with the demand.
This sounds like a nightmare, right? This is what we call the thundering herd problem. In this article, we will discuss what is the thundering herd problem and how you can tackle it effectively when designing a system.
Understanding the thundering herd problem
The thundering herd often appears when:
- External events occur: A popular sale, trending news, or a viral tweet can trigger a sudden surge in requests.
- Cache invalidation: When multiple clients invalidate the same cached item, they all rush to fetch the updated data from the server.
- Retries gone wrong: Clients encountering temporary errors might retry requests simultaneously, exacerbating the problem.
- Scheduled tasks: When a scheduled task is due, multiple clients might try to execute it at the same time.
How to tackle the thundering herd problem?
There are several strategies that you can use to tackle the thundering herd problem. Let’s talk about some of them.
Rate limiting
Rate limiting is a technique used to control the rate of incoming requests to a server. By setting a limit on the number of requests that can be made in a given time period, you can prevent the server from being overwhelmed by a sudden surge of requests.
For example, you can set a limit of 100 requests per second. If a client tries to make more than 100 requests in a second, the server will reject the requests, preventing the server from being overwhelmed.
Exponential backoff
Exponential backoff is a technique used to prevent a large number of clients from retrying a request at the same time. When a client makes a request and receives an error, it waits for a random amount of time before retrying the request. If the request fails again, the client waits for a longer amount of time before retrying. This process continues, with the client waiting for increasingly longer amounts of time between retries, until the request is successful.
So for example you can have a backoff time of 1 second for the first retry, 2 seconds for the second retry, 4 seconds for the third retry and so on. This helps to spread out the requests over time, preventing a large number of clients from retrying the request at the same time.
Circuit Breakers
Circuit breakers are a great way to prevent a large number of clients from overwhelming a server. Think of these as automatic safety switches. When failures exceed a threshold, the circuit breaker trips, temporarily preventing further requests until the service recovers, preventing cascading failures.
Smart Caching
If you are serving some resource which can be cached, you can use caching to reduce the load on the server. By caching the response to a request, you can serve the response directly from the cache without having to hit the server thereby reducing the load on the server.
Ideally we want to implement caching strategies that automatically refresh based on usage patterns and handle invalidation efficiently to avoid herd formations.
Bulkhead pattern
The bulkhead pattern is a technique used to isolate different parts of a system from each other, preventing a failure in one part of the system from affecting other parts of the system. By isolating different parts of the system, you can prevent a failure in one part of the system from causing a cascading failure that affects the entire system.
For example let suppose your service has two different endpoints, one for reading data and one for writing data. By isolating the read and write endpoints from each other, you can prevent a failure in the write endpoint from affecting the read endpoint, and vice versa.
Load balancing and Autoscaling
Load balancing is another great technique used to distribute the load across multiple servers. By using a load balancer, you can spread out the requests across multiple servers, preventing any single server from being overwhelmed. This can be very effective in preventing the thundering herd problem, as it allows you to scale out your system to handle a larger number of requests.
Autoscaling dynamically increases resources during peak times, ensuring your system can handle the stampede gracefully. This also allows you to horizontally scale your system, which can be very cost effective.
Conclusion
All in all, the thundering herd problem is a very interesting problem that can occur in a distributed system. By using the strategies discussed in this article, you can effectively tackle the thundering herd problem and prevent your system from being overwhelmed by a sudden surge of requests.