Consistent hashing
An introduction to hashing in distributed systems. And why consistent hashing is a good approach.
Hi Friends,
Welcome to the 5th issue of the Polymathic Engineer newsletter.
This week we will cover the following topics:
Hashing in distributed systems
Caching servers and hashing
Data partitioning and hashing
The hashing problem
Naïve approach
Consistent hashing
Virtual nodes
Coding challenge
Technical threads of the weeks
An interesting tweet I read this week
Hashing in distributed systems
In software programming, hashing maps an arbitrary piece of data into a fixed-size value. However, the exact process is extensively used at a higher level of abstraction while building distributed systems.
There are two typical scenarios where hashing is helpful. The first is mapping web client requests to cache servers, and the second is mapping data to sharding servers where the data is partitioned.
Caching servers
Many systems use fast in-memory caches on each server to store computationally expensive client requests. Suppose the response to a request is in the cache. In that case, the server can immediately return the response, reducing the overall latency.
This caching strategy is effective only if the same kind of requests are rerouted to the same server. Load balancers often use hashing to guarantee this property.
They hash the incoming requests and use that hash to reroute the requests according to the hash value. For example, they could hash the IP address or the username of the client or the HTTP request.
This works well until cache servers are not dynamically added or removed.
Data partitioning
To partition data across multiple database servers, assigning each piece of data to a specific server is necessary. Hashing is usually helpful. The data key is hashed, and the hash is used to find the database server where the data is stored.
This is a good approach as long as database servers are not dynamically added or removed. Indeed, moving data between servers is expensive, so a strategy is necessary to minimize this.
The hashing problem
The problem of using hashing in distributed systems can be easily generalized. There are 3 entities
Keys: unique identifiers for data or workload requests
Values: data or workload requests that consume resources
Servers: entities that manage data or workload requests
The goal is to map keys to servers while preserving three properties:
Balancing: keys should be balanced in number between servers
Scalability: servers should be added or removed with the low computational effort
Lookup Speed: given a key, it should be possible to efficiently identify the correct server.
Naïve approach
The most straightforward strategy for mapping data keys to servers using hashing is to
number the servers from 0 to N-1;
calculate the hash value of the key modulo N to find a server number;
assign the key to the server identified by that number.
This schema assigns the key always to the same server but has the drawback of not being horizontally scalable.
If a server is removed, the requests modulo N will continue to be routed to the removed server. Similarly, the requests will never go to the new server if a server is added.
If N is changed, all the existing mappings get broken. So all the data need to be remapped and migrated to different servers.
Consistent hashing
Consistent hashing is a great approach that mitigates the drawbacks of the naïve approach. The key idea is to consider the space of possible hash values circular instead of linear.
Both servers and keys are hashed and placed in a circle. Each key is then assigned to the first server in a clockwise direction. This strategy allows re-mapping only a fraction of the keys when a server S is added or removed.
For example, if S is added, only the keys between the previous server in the circle and S are mapped to S. Only the keys assigned to S are mapped to the next server in the circle if S is removed.
Assuming a uniform distribution of the keys on the circle, the average fraction of re-mapped keys is k/n, where k is the number of keys and n is the number of servers. This guarantees the scalability property but not the balancing since it can generate a non-uniform distribution of the keys between the servers.
Even if the servers were initially evenly distributed in the circle, the distribution would become unbalanced after adding and removing some servers.
Virtual nodes
The balancing problem can be solved by introducing replicas or virtual nodes for each server across the circle. Then, multiple hash functions map each server into various places on the circular space.
The virtual nodes divide the circular space into smaller ranges, and each physical server is assigned to several smaller ranges. Virtual nodes are randomly distributed across the circle. They are generally non-contiguous, so no neighboring nodes are given to the same physical server.
As the number of replicas or virtual nodes in the circle increase, the key distribution becomes more and more uniform. Typically, the number of virtual nodes is large.
Additional advantages of introducing virtual nodes are:
Virtual nodes divide the circular space into smaller subranges. This speeds up the rebalancing process after adding or removing nodes because the keys to be reassigned involve multiple physical servers instead of one.
Virtual nodes can carry physical replicas of a server for fault tolerance.
Virtual nodes make it easier to maintain a cluster of servers made by heterogeneous machines. Indeed, assigning a higher number of virtual nodes to powerful servers and a lower number to less powerful servers is possible.
Coding challenge
The coding challenge of last week was H-Index.
The problem gives as input an array of N integers representing the number of citations that an academic paper received. The goal is to find the higher index h in the range [0, N] so that there are h papers with citations >= h and N-h papers with citations <= h.
The brute force solution is not difficult to find. The idea is to iterate an index j backward from h to 0 and check if there are at least j citations >= j. The h-index is the first j for which the condition is verified. The time complexity is O(N^2).
Finding a better solution is more elaborate. A helpful observation is that once the citations are sorted in descending order, the h-index is that index j dividing the array into 2 parts:
a first part where all citations are >= j
a second part where all citations are <= j
So if S is the array of citations sorted in descending order, the h-index is the first j for which S[j] is lower or equal than j.
For example, if the citations are [3,0,6,1,5], then the index j = 3 divides S = [6,5,3,1,0] in two parts: [6,5,3] having all elements >= 3 and [1,0] having all elements <=3.
This solution has a time complexity of O(NlogN) because of the sorting step.
A better solution can be obtained using a non-comparison-based sorting algorithm like Counting Sort. The critical observations are:
the possible indexes are in the range [0, N]
any citation larger than N can be counted as N without affecting the h-index
Using counting sort to implement the sorting step reduces the time complexity of the previous solution to O(N).
An intelligent tweak of one of the counting sort' steps makes sorting the array unnecessary. The idea is to compute the cumulative sum of the citations' frequencies in reverse order.
If R is the array containing the cumulative sum in reverse order, then R[j] represents the number of papers having more than j citations. So, iterating j from right to left, the h-index is the first index for which R[j] is greater or equal to j.
For example, if the citations are [3,0,6,1,5], then the frequencies are [1,1,0,1,0,2] and R = [5,4,3,3,2,2]. The first index for which R[j] >= j is 3.
Here is the code of this last solution.
For the next week, I challenge you to solve the problem Longest Substring Without Repeating Characters.
Technical threads of the week
Radix sort.


Row-oriented vs column-oriented databases.


An interesting tweet of this week

This tweet is from this summer, but I would like to bring it to your attention now because this is when many of us have end-year reviews. I loved all the tips Janahan gave for getting a salary raise, especially the one about keeping a brag doc. Our memory is limited, and the brag doc helped me the most during my career.