Hi Friends,
Welcome to the 3rd issue of the Polymathic Engineer newsletter. We’re more than 600 this week. This is already much more than I expected. Thank you so much for your trust and support.
This week we will cover the following topics:
what is caching
caching in distributed systems
backend caching strategies
caching eviction policies
solution to coding challenge of last week
an interesting tweet I read this week
Caching
A cache is a hardware or software component that enables faster access to data. The main idea of caching is to store data in a location different from the original one so it will be quicker to access them.
You can imagine a cache like a fast short-term memory with limited space. If the data you’re looking for is in the cache, you have a cache hit, and you read the data from there. Otherwise, you have a cache miss and read the data from the original location.
Caching in Distributed Systems
Caching occurs at many different levels of a distributed system: hardware, operating system, front-end components, web applications, databases, and so on. Its crucial role is reducing the latency by:
saving network requests
storing the result of a computationally expensive operation executed on a server
avoiding repeated operations (e.g., same database query performed by multiple servers)
Most distributed systems use a combination of two primary approaches to caching: application caching and database caching.
Application caching requires a direct integration into the application code itself. The application code usually checks if a value is in the cache, retrieving the value from the database if not. Since accesses to RAM are orders of magnitude faster than disk, the most performant application-level caches are those storing all data in memory. Memcached and Redis are good examples, even if Redis can also be configured to store some data data on the disk.
Database caching can generate a significant performance improvement without requiring any change in the code. This is because every database provides some degree of caching that can be tweaked to meet the system’s access patterns.
In general, it’s necessary to remember that, if implemented incorrectly, caching can increase client latency and add additional burdens to the backend. For example, suppose the cache miss rate is too high. In that case, the caching tier adds more latency than it’s reducing. Therefore, the system’s performance will improve by removing it.
Another disadvantage of using a cache tier is dealing with stale data. So unless the cached data is not static, it’s necessary to have a strategy for invalidating the stale data and sending as little old information as possible to clients.
Caching Strategies
There are several ways of implementing caching in an application backend. Here are the most popular.
1. Cache aside
The application reads the data directly from the cache. If the data is in the cache, it's returned to the client. Otherwise, the application retrieves the data from the database and then writes it to the cache. The advantages of this strategy are:
good for read-heavy workloads
Database and cache are decoupled and can handle different data models.
Data are stored in the cache only when necessary (lazy loading), preventing the cache from being loaded with unnecessary data.
The disadvantages are:
stale data may be provided without a proper expiration time
In the beginning, the cache is empty, so there are many cache miss
the application code needs to take care of everything
Read through
It’s similar to the read-aside strategy, but the application interacts only with the cache. In case of a miss, the cache reads the data from the database, stores them, and returns them to the application.
This approach simplifies the application code but complicates the cache implementation since a custom plugin for fetching the data is often required.
Write through
The application write data to the cache and the cache write immediately the data to the database. This approach ensure data consistency between the cache and the database with no data loss in case of cache crash. The downside is the higher latency for write operations.
Write back
The application write data to the cache and the cache write asynchronously the data to the database. This is a good strategy for write-heavy workloads, since it ensure lower latencies respect to the Write through. It also reduces the load on database and it’s more tolerant to database failures. The downside is that data can get lost in case of cache crash.
Write around
The application writes the data directly to the database. Only the data that is read get into the cache. This is a good strategy for data written once and read less frequently since the cache store only re-reads data. However, there is higher latency when reading recently written data because this always gives a miss.
How to choose a cache strategy
There is no best caching strategy. It all depends on the data and their access patterns.
Cache around is usually an excellent general-purpose strategy for read-intensive applications. Memcached and Redis are widely used caches implementing this strategy.
In the case of write-heavy workloads, a write-back approach is instead a good choice. DynamoDB Accelerator (DAX) is an example of a cache combining both read/write through.
Writing around could be a good fit when data is written once and rarely read (i.e., real-time logs).
Additional useful resources on the topic are:
Eviction Policies
The quantity of storage accessible in the cache tier is typically much less than that in the database. As a result, the cache will sooner or later fill up, and it would be impossible to add new data.
This is why a cache eviction policy is necessary. In general, an optimal cache eviction policy can replace data that is not commonly read with data that is being read frequently. In practice many different strategies can be adopted. Here are the most common ones:
First In First Out (FIFO): remove the first block accessed
Last In First Out (LIFO): remove the block accessed most recently
Least Recently Used (LRU): discard the least recently used items first
Most Recently Used (MRU): evict the most recently used items first
Least Frequently Used (LFU): keep track of how often an item is needed, discarding the items that are used least often first
Random Replacement (RR): randomly select a candidate item and discards it to make space when necessary.
Coding challenge
The coding challenge of last week was Subarray Product Less Than K.
Given an array of positive integers and an integer k, the problem asks to find the number of contiguous subarrays where the product of all the elements is strictly less than k.
The brute force solution is to check every possible subarray, check if the products of its elements are strictly less than k, and, in case, add the subarray to the answer. This is inefficient because those subarrays overlap, and a lot of repeated work is done. The same multiplications are executed over and over again.
The sliding window approach is a more brilliant solution. The idea is to track the product of the elements currently inside the window and ensure that this product is always strictly less than k. The right pointer expands the window at each step. The left pointer shrinks the window when the product of its elements is greater or equal to k.
Since the integers are positives, all the subarrays inside the window at each step need to be considered valid and included in the answer. As we discussed last week, the number of those subarrays corresponds to the window‘s length.
Here is the Python code with the solution:
For the next week, I challenge you to solve the problem Find Players With Zero or One Losses. I’ll provide the solution next week.
An interesting tweet of this week


We live in a data-driven society, where firms that use data to make better decisions and respond to changing demands thrive. In case you didn’t know, there are two types of data processing systems: Online Transaction Processing (OLTP) and Online Transaction Analytics (OLAP). This thread highlights their difference, explaining their connection with row-oriented and column-oriented databases. This is an exciting topic, and I’ll write soon something on this.
You articles are really great! You explain interesting topics in a very pedagogical and captivating way. Thank you.