Rendezvous Hashing
Hi Friends,
Welcome to the 64th issue of the Polymathic Engineer newsletter.
This week, we talk about an interesting hashing strategy that can be used in distributed systems: rendezvous hashing.
Most people are familiar with consistent hashing, which we discussed in the 5th issue of our newsletter. Rendezvous hashing is a less-known hashing technique based on a different approach and is more suitable in certain situations.
The outline will be as follows:
Use cases of hashing in distributed systems
Rendezvous hashing
Comparison between rendezvous and consistent hashing
Post that made me think
Hashing in distributed systems
There are two main scenarios for using hashing in distributed systems.
The first use case is when you need to provide elastic scaling for cache servers. Let's suppose you have a system where clients talk with multiple servers through load balancers.
If the clients' requests are computationally expensive, it is reasonable to use a fast in-memory cache on each server. If the response to a client request is in the cache, the server can immediately return the response.
To effectively use caching, load balancers guarantee that all the same kind of requests will be routed to the same server. They hash the incoming requests and use that hash to reroute the requests according to the hash value.
A second use case is when you need to scale a set of storage nodes. Let's suppose you have a web application and you want to shard your data across multiple database servers.
How do you know which server a particular piece of data will be stored on? Hashing is usually helpful. The data key is hashed, and the hash is used to find the database server where the data is stored.
Rendezvous Hashing
The most straightforward hashing strategy is to
number the servers from 0 to N-1
calculate the hash value of the key modulo N to find a server number
assign the key to the server identified by that number.
This schema works because it always assigns the key to the same server but has a big problem. The mapping between keys and servers gets broken each time a server is added or removed.
Rendezvous hashing is an approach that solves this problem. The basic idea is to calculate a ranking of all the possible servers for each key using some logic. Each key is assigned to the server with the highest ranking for that key.
The advantage of this strategy is clear. Only the keys for which the S has the highest ranking are affected when adding or removing a server S.
When removing S, the keys are assigned to the server with the second highest ranking in their list. Only the keys for which S has the highest ranking are assigned to S when adding S.
Rendezvous hashing is relatively straightforward to implement. Four steps are required:
hash all the possible key-server combinations with a hash function to get a unique list of servers for each key
sort the servers based on the hash values
assign each key to the server with the highest hash value
maintain the association between each key and the server with the highest hash value after adding and removing servers
Some simple ways to hash keys and servers together include concatenating the key with the server or using the server ID as a hash seed.
Comparison with Consistent Hashing
There are three main aspects to consider when comparing rendezvous and consistent hashing.
Query time. Consistent hashing typically uses binary search to map a key to the closest server. So, each query takes O(log(NV)) time, where N is the number of servers and V is the number of virtual nodes for each server. Rendezvous hashing typically takes O(N) because you must calculate the hash for each key-server combination.
Memory requirements. Consistent hashing requires some fixed memory to store the hash values for server and virtual nodes and the mapping between servers and virtual nodes. Rendezvous hashing doesn't require storing any additional data.
Complexity. Rendezvous hashing is more straightforward to explain, understand, and implement than consistent hashing. It provides an even distribution of the keys when adding and removing, assuming a good choice of the hash function. Consistent hashing alone (without virtual nodes) fails to provide an even distribution of the keys, especially for small clusters.
In summary, consistent hashing trades load balancing for scalability and lookup speed, while rendezvous hashing provides an alternative tradeoff that emphasizes equal load balancing.
Consistent hashing is more extensively used, but rendezvous hashing can be a good algorithm for load-balancing medium-size distributed systems, where an O(N) lookup cost is not prohibitive.
Posts that made me think
If your assumptions are wrong, it doesn't matter how sound your logic is. If you start with the wrong premise, you can never arrive at the proper conclusion.
What most people do not understand is that TDD is not a testing strategy but a design technique. The technique just uses tests as part of the process. I don't use it often, but I do when I feel it is helpful. Link
That's precisely my perspective. If I had my office only 10-15 minutes away from home, I wouldn't mind being there every day. I do not dislike the office, but the 2 hours I lose commuting back and forth.