

Discover more from The Polymathic Engineer
Hi Friends,
Welcome to the 46th edition of the Polymathic Engineer newsletter.
This week, we will entirely focus on distributed systems. First, we will talk about which guarantees can be associated with delivering messages between the different nodes in a distributed system.
Second, we will discuss a popular distributed system pattern called Hinted Handoff.
Neo Kim, the author of the well-known System Design Newsletter, will present this second topic. He is the first guest author I am hosting here on the Polymathic Engineer, and I hope you’ll enjoy the reading.
The outline will be as follows:
Message delivery semantics in distributed systems
The Hinted Handoff pattern and its applications
Interesting Tweets
Before starting with the technical topic, let me thank all of you who became paid subscribers of our newsletter.
I appreciate your support and have already started working on the first paid-only issue dedicated to a popular NoSQL data store: Apache Cassandra. It will include a detailed description of its architecture and a hands-on tutorial.
Message Delivery Semantic
Distributed systems works because nodes exchange messages over a network. But which guarantees can be given about the delivery of those messages?
There are 3 possible messages delivery semantics between 2 nodes:
at-least-once: messages can’t be lost but the same message might be delivered multiple times. This semantic is not ideal from a user perspective, but it usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the receiver side. For example, messages with a unique key can be easily rejected when duplicate data is received.
at-most-once: a message cant be delivered more than once. Messages may be lost but are not redelivered. This semantic is also not ideal, but it is suitable for use cases where a small amount of data loss is acceptable (i.e. monitoring or logging).
exactly-once: a message is guaranteed to be received exactly once. It is the best semantic for the users, but as we will see it can’t be actually implemented. It can be only emulated at the price of a high cost for the system’s performance and complexity.
Why exactly-once delivery can’t be implemented?.
Let's suppose that a node A sends a message to a node B. To be sure that the message has been received, node A needs to receive an acknowledge from B.
There are 2 different possibilities according to when the message is acknowledged:
B acknowledges the message immediately before processing it. A receives the ack and thinks that everything is fine. But if B crashes during the processing, the message is lost forever. So the message can be either received or not by B (at-most-once delivery).
Actually things are more complicated if there is a message queue in the middle and multiple workers. In this case the broker shall ensure that a message is not delivered to any other worker once it’s been acknowledged. Kafka handle this coordination using ZooKeeper.
B acknowledges the message after it has been processed. If B crashes before acknowledging the message or the ack isn’t delivered for a network partition, A will redeliver it. So the message can be either received once or multiple times by B (at-least-once).
Since these are the only 2 possibilities, it's not possible to guarantee that a message is received exactly once. The only way to get an exactly-once semantic in practice is by faking it in 2 possible ways.
The 1st is implementing a deduplicating message logic on B. The 2nd one is using idempotent messages that can be applied more than once without side effects.
RabbitMQ attempts to guarantees an exactly once delivery using the above 2 techniques.
Hinted Handoff
Imagine you want to go on a coffee break at work, but you expect to receive some messages. If you go either the senders wait until you return to your desk or you miss the messages.
But there is a third possibility: you inform a coworker to collect any message on your behalf, and they give you the messages when you return to your desk.
In distributed systems, this technique of storing messages when the receiver is unavailable is called hinted handoff, and it is usually used to do repairs in the write path.
The workflow
The server responsible for storing data is called the target node. The server that stores data temporarily is called the coordinator node. Here is how it works:
The coordinator node stores hints if the target node is unavailable for writes. Hints are data changes along with metadata of the target node.
The coordinator node then sends the hints to the target node when it’s available again.
The hints get stored on disk in the coordinator node for a certain time for durability.
The coordinator node rejects hints if the target node stays unavailable for an extended time.Â
The hints get removed when the target node is decommissioned or their TTL expires.
A networking protocol like gossip protocol can be used to check the health status of servers.
Pros and cons
The hinted handoff provides eventual consistency and high availability. It protects the system against temporary failures of the server..
The benefits of hinted handoff are:
Improved read performance
High write availability even while operating at reduced capacity
protects the system against temporary failures of the server
The limitations of hinted handoff are:
Risk of data loss if the coordinator node fails before sending the hints to the target node
Extra storage needs and increased system complexity
Some popular use cases of hinted handoff are the handling of temporary node failures Amazon DynamoDB, and the routing of traffic to healthy nodes within CDNs.
Interesting Tweets
During my career, the more I learned about creating high-quality software, the more humble I became. Nobody in a company wants to deal with big ego, doing business is the priority.
Networking with non-technical people is what make you a better engineer. If you don't get better in communicating with them, it will be harder to grow into more senior positions.
Well written tests are the best form of documentation. Even better than any developer written wiki pages.
Message Delivery
hey Fernando, thank you so much for the guest post opportunity in your awesome newsletter.
Super interesting read. Thanks both Fernando and NK!
The article came at a perfect time, I was just going through the Amazon Dynamo paper. It was interesting to learn the durability concern of the hinted handoff and that it can be remediated with periodic replica synchronization to reconcile them
About the message queues for exactly-once delivery, I think idempotent messages solve part of the problem, but may introduce an extra problem of message ordering. x=5 is idempotent as I can execute it multiple times in a row, but when interacting with another message, if the order on the consumer side is x=5, x=3, x=5 (e.g. a retry of the x=5 message), then the outcome is not expected.
I'd love to read more about message ordering in distributed systems. Looking forward for the next articles!