

Discover more from The Polymathic Engineer
Synchronizing machines
Why synchronize the clock on distributed machines is important. And how to synchronize clocks with the NTP and PTP protocols.
Hi Friends,
Welcome to the 36rd issue of the Polymathic Engineer newsletter. I will start this issue with a more personal note.
The last two weeks have been tough for me since my daddy passed away and is no longer with me. The pain of losing someone you love is tough. Few things compare to the waves of deep and complicated emotions it evokes. I'd never felt such profound sadness, emptiness, numbness, or regret. Probably only time will ease this sorrow.
My father never used a computer or owned a mobile phone in his entire life. But he has always been passionate about clocks, and in his last days, he was always trying to synchronize an old clock with the current time.
In this issue, I want to focus on clock synchronization in distributed systems: why it is critical, and how it can be achieved.
I dedicate this edition of the newsletter to my father.
The importance of synchronizing clocks
Maintaining synchronized system clocks is crucial for running a distributed system. Even a few seconds of offset between servers can cause many different issues
Consistency. Data stored across multiple nodes needs to be propagated to all nodes, and a conflicting timestamp can cause confusion about which update is the most recent.
Observability. Logs, metrics, and traces are essential for understanding the system, but they are helpful only if machines have synchronized clocks.
Security. Network security is crucial and many cryptographic protocols rely on synchronized clocks for correctness (i.e. Kerberos authentication protocol).
Ordering. Understanding the order in which events occur is important and requires synchronized clocks. For example, processing transactions in a wrong order can lead to incorrect account balances and unhappy customers.
The NTP protocol
The reason why computers get unsynchronized is that they usually use quartz clocks. These clocks are cheap, but not so accurate: they can drift by a couple of seconds per day.
In order to compensate the drift a computer needs to synchronize with an atomic clock. They are very expensive clocks, but have an error rate of ~1 second in a span of 100 million years.
Network Time Protocol (NTP) is the protocol commonly used to synchronize a computer with an atomic clock. The computers that need to be synchronized are the NTP clients, and the computers keeping track of the time are the NTP servers.
The following steps describes the interaction between a NTP client and a NTP server at high-level:
The client sends an NTP request packet to the server. The packet contains the starting timestamp, which is the time from the client.
The server marks the time when the request packet is received.
When the server sends a response packet back to the client (the "transmit timestamp"), it marks the time again.
The client marks the time when it gets the answer packet from the server.
All the timestamps marked in this process allow the client figure out the difference between its own time and the time given by the server. This is done by taking into account the round-trip delay. Multiple calls to the NTP server tell it what to do and how to synchronize itself.
The NTP Stratum model
Unfortunately, there is a limit to the number of computers that a single NTP server can synchronize. To make it possible to synchronize millions of computers NTP works on a peer-to-peer basis according to a stratum model.
NTP servers are divided into strata, and each stratum corresponds to a different device. Stratum 0 is an atomic clock or GPS receiver, Stratum 1 is a server synced directly with a stratum 0 device, Stratum 2 is a server synced directly with a stratum 1 device and so on.
To ensure synchronization, computers can query multiple servers, discard outliers, and average the rest. They can also query the same server multiple times to reduce random error due to network latency variations.
A NTP alternative: PTP
NTP is still widely used, but it is one of the oldest protocols. This is why more advanced protocols have been introduced to sync clocks more precisely.
For example, PTP use hardware timestamping and transparent clocks to better measure this network delay and adjust for it. PTP places more load on network hardware, but gives quite a few benefits:
High Accuracy: PTP can be accurate to within nanoseconds, while NTP can only be accurate to within milliseconds.
Scalability: NTP systems need to check in often to make sure everything is in sync, which can slow down the network as the system grows. On the other hand, PTP lets systems get their time from a single source, which makes them easier to scale.
Network Delays: with PTP it is much less likely that there will be network delays and mistakes.
PTP deployment at Meta
Meta is one of the companies that successfully used and deployed PTP in its infrastructure. The system consists of three main components: PTP Rack, Network, and Client.
The PTP Rack hosts hardware and software for serving time to clients. The hardware consists of two pieces. A GNSS antenna that communicates with Global Navigation Satellite System. A Time Appliance that consists of a GNSS receiver and a miniaturized atomic clock, allowing to keep accurate time, even if the GNSS connectivity gets loss.
The Network transmits PTP messages using unicast transmission, simplifying network design and improving scalability.
A PTP client is needed to communicate with the PTP network. Meta used an open source client like ptp4l.
References
Interesting Tweets
The more you move up, the more your responsibilities and the scope of your work increase. But many engineers do not understand that to move up, they must show they can already meet the additional expectations. Link
Writing code, even beautiful code, is meaningless if it doesn't address business problems and add value. This is why understanding your organization's business is a prerequisite to level up as a software engineer. Link
After a while, I normalized the process of being rejected in interviews. It doesn't mean you are necessarily bad. There is a lot of competition and it can simply be that another candidate performed lightly better than you or it is a better fit for the role. Link
Thinking out loud and asking clarifying questions is cruial in tech interviews. I've seen many candidates tackling interviews like code monkeys and failing. Interviewers want to understand your reasoning and if you don't communicate they can't even help when you're stuck. Link
Synchronizing machines
I might have to dig in to NTP as a future Coding Challenge.