Caching in Distributed Systems - Part III
Challenges of running caching in production environment: cache avalanche and penetration, traffic patterns, cold start, and consistency challenges.
Hi Friends,
Welcome to the 145th issue of the Polymathic Engineer newsletter.
This week, we complete our deep dive on caching in distributed systems. If you use caching in production, you can encounter problems that you don’t run into during development. Weird issues appear out of nowhere because of the way traffic works, system failures, and the scale.
However, most of these problems are well known, and there are tried-and-true ways to solve them. Understanding these operational challenges before they cause production issues can save you from some stressful situations. In this article, we will examine the most common issues and provide guidance on resolving them.
The outline will be as follows:
Cache Avalanche and Thundering Herd
Cache Penetration
Traffic Pattern Problems
Large Key
Cold Start
Consistency Challenges
Project-based learning is the best way to develop solid technical skills. CodeCrafters is an excellent platform for practicing exciting projects, such as building your version of Redis, Kafka, a DNS server, SQLite, or Git from scratch.
Sign up, and become a better software engineer.
Cache Avalanche and Thundering Herd
A cache avalanche occurs when multiple cache records expire simultaneously. This sends a huge number of requests to the backing storage all at once, and it’s different from normal cache misses, which occur gradually and are easier to handle.
Let’s suppose you have a lot of cache records that all expire at midnight. Everything that was previously served from cache must now be retrieved from the database for every request. Your database was previously able to handle 10% of the traffic, but now it must deal with all of it. The result can be a complete system failure.
The solution for cache avalanche is staggered expiration. Instead of setting a TTL of exactly 3600 seconds, you might put it to 3600 plus a random number. This spreads out expiration times, so cache entries don’t all expire at once.
A thundering herd is a related problem that occurs when multiple clients attempt to rebuild the same cache record simultaneously. Let’s say that a popular piece of data is removed from the cache, and 100 requests at the same time notice that it’s missing.
Without coordination, all 100 requests will try to fetch the data from the database and update the cache. That means having 99 unnecessary database calls. The solution is to implement request coalescing. When multiple requests need the same data, only the first one actually fetches it from the database. The others wait for that first request to complete and then use the result. Many cache libraries provide this functionality built in.
Cache Penetration
Cache penetration happens when clients repeatedly request data that doesn’t exist in either the cache or the database. The data is never stored because it doesn’t exist, so every request is sent directly to the database. This may not seem dangerous, but if it happens on a large scale, it can be very bad.
Attackers can take advantage of this by asking for data that they know doesn’t exist, which fills up your database. Caching negative results is the easiest thing to do. If a database query doesn’t return any data, store that fact for a short time. When you ask for the same data again, it won’t be in the database, but in the cache.
You could also use a bloom filter, a data structure that allows for saving space and can tell whether data doesn’t exist. If the filter indicates that the data isn’t there, you can return immediately without hitting the database
Traffic Pattern Problems
The hot key problem occurs when a small number of cache keys receive an excessively high amount of traffic. In a distributed cache, this means that some nodes become too busy while others remain mostly idle.
This often happens during significant events. The typical example is when a celebrity posts something controversial on a social media app. In just a few minutes, that one post could get a million views, all of which would hit the same cache key on the same cache node.
The traditional solution is to split hot keys across multiple cache nodes. Instead of storing the data under a single key, you store it under multiple keys, such as hot_key1, hot_key2, and so on. When clients need the data, they pick one of the suffixed keys.
Before the traffic hits, you can also prepare for expected hot keys by preloading data and distributing it across your cache infrastructure.
Large Key Problem
Large keys hold more data than your cache is designed to handle. When a single cache entry is several megabytes, it can cause timeouts, inefficient memory usage, and slow network transfers. The impact goes beyond just the large key itself. Frequent access to large keys can consume significant network bandwidth, which slows down everything else. It takes a long time to reload large keys from the database after they expire or are kicked out. During this time, other processes may fail.
Several approaches can help. You can compress large values before storing them in the cache, which reduces both memory usage and network transfer time. You can also split large objects into smaller pieces and store them under separate keys.
However, the best solution is always to prevent large keys from existing in the first place. For example, instead of caching entire documents, you might cache only the most frequently accessed parts.
Cold Start Problem
The cold start problem happens when you need to restart your cache infrastructure and suddenly have zero cached data. Every request becomes a cache miss, resulting in all traffic being sent directly to your database.
This is very risky because your database is only used to handle a small portion of your total traffic. However, during a cold start, all calls are directed simultaneously to the database, which can cause it to crash.
The solution is cache warming, which involves retrieving important data ahead of time before serving live traffic. You can replay recent access logs at a controlled rate to gradually fill your cache before allowing everyone to use it.
If you want to do planned maintenance, you can copy data from another cache server to yours to warm it up. For unexpected outages, you may need to use gradual traffic ramping, which means sending more and more traffic to the cache as it warms up.
Consistency Challenges
Keeping your cache and database in sync when things go wrong is one of the toughest challenges. In an ideal world, every change you make to the database would be quickly reflected in the cache. In fact, the two can be different because of factors like network problems, process crashes, and race conditions. The safest approach is to make your system capable of handling some errors. If your cache gets out of sync, use TTL as a safety net. When things expire, the cache will fix itself.
For write-heavy workloads, you should carefully think about the order of operations. Most of the time, it is safer to write to the database first and then clear the cache. If something goes wrong between the two steps, you’re more likely to serve slightly old data than data that is totally wrong.
If you have multiple cache instances, keeping them consistent gets even harder. Users might get confused because different instances may have different copies of the same data.
For read-heavy workloads, eventual consistency is often acceptable. Use TTL to make sure that all instances reach the same state in the end, even if they differ for a while. For write-heavy workloads or critical data, you may need to set aside a primary cache instance that handles all writes and copies data to other instances in an asynchronous fashion.
The key is understanding which data needs strict consistency and which can handle being a little out of date. Not everything needs to be perfectly synchronized. Focus your consistency efforts on the data that matters most to your users.
Interesting Articles
Some interesting articles I read in the past weeks:






Excellent analysis! This cache avalanche concept reminds me of when all my favorite Pilates classes open for booking at the same exact time. So stressful!
Steeze problem solving article, love the concept of cache warming and bloom filters, I look forward to building projects that implement both!