The fallacies of distributed systems
Eight distributed systems fallacies that are underrated during system design.
Hi Friends,
Welcome to the 97th issue of the Polymathic Engineer. This week, we talk about the fallacies of distributed systems.
More than 20 years ago, Peter Deutsch and others at Sun Microsystems came up with a list of false assumptions that many developers new to distributed applications always make.
These assumptions are always shown to be wrong, which makes bugs hard to fix. We are going to look at these eight fallacies in more depth here.
Project-based learning is the best way to build technical skills. CodeCrafters is an excellent platform to practice interesting projects such as building your version of Redis, Kafka, HTTP server, SQLLite, and even Git from scratch.
Sign up, and become a better software engineer.
The Network is Reliable
When building a distributed system, it's dangerous to assume that the network that connects your components works perfectly.
Indeed, networks are inherently unreliable: packets can be dropped, connections can be interrupted, and data can become corrupted during transmission. And that's just the beginning.
To build a robust system, we need to accept and plan for these potential failures. An effective strategy is to put in place retransmission mechanisms. The store and forward pattern is a common way to do this.
Instead of directly sending data to a downstream server, we can store it locally or in an intermediate location. This allows for retries and provides some ways to get back on track in catastrophic scenarios where a simple retry loop wouldn't work.
Message brokers like RabbitMQ and ActiveMQ are ideal to support this pattern, as are various cloud-native solutions offered by major providers.
Fallacy #2: Latency is Zero
Latency is the time it takes for data to move from one place to another, and it is a fact you can't ignore in distributed systems.
Latency is like extra work that every request has to do. Regardless the messages are large or small, latency remains constant since it is primarily affected by the speed of light and the distance between communicating systems.
The most obvious way to mitigate latency issues is to move data closer to clients. In a cloud environment, we can choose availability zones carefully based on your clients' locations.
In addition, we can use caching to reduce network calls and duplicate the data closer to where it is needed using Content Delivery Networks (CDNs).
Fallacy #3: Bandwidth is Infinite
Bandwidth is the amount of data that can be transferred in a given time and is not unlimited. While bandwidth has improved over time, the amount of data we send also increased.
Modern applications like VoIP, or video streaming consume a lot of bandwidth. Even simple things like richer UIs and verbose data formats (like XML) increase the data transfer.
There's a delicate balance between minimizing latency and conserving bandwidth. We should transfer more data in fewer trips to reduce latency. However, we need to send less data overall to save bandwidth.
Efficient data formats and compression are all things that help address bandwidth limitations.
Fallacy #4: The Network is Secure
Assuming the network we are using is secure is a critical mistake. Security should be a top priority with the rise of bug bounty programs and frequent news of major exploits.
Malicious users constantly try to intercept and decode network communications. This makes data encryption and security risk mitigation essential.
Keeping security in mind from the start of system design always pays off in the long term. Also, it's important to take the time to look carefully for possible security threats.
Fallacy #5: Topology Doesn't Change
The network structure isn't static. It changes frequently, sometimes due to something breaking, but most of the time because we add or remove components on purpose.
With the rise of cloud computing and containerization, network topology changes are even more common. Also, elastic scaling, where we add or remove servers based on workload, requires flexibility.
Zookeeper and Consul are great tools that help solve problems with service discovery and let apps adapt to changes in how our systems are set up and how they work.
Fallacy #6: There is One Administrator
As a system grows, it often relies on external systems outside our control. This means we can't manage everything ourselves.
It's crucial to consider all the dependencies, from our code to the servers we run on. As the number of systems and configurations increases, it becomes hard to manage and track everything.
The first way is to implement robust monitoring and observability tools, which are critical for diagnosing issues quickly when they arise.
Also, using Infrastructure as Code (IaC) to codify the system variations and focus on appropriate decoupling helps ensure overall system resiliency and uptime.
Fallacy #7: Transport Cost is Zero
Sending data between systems is a simple business cost only when things are small. However, as systems grow, the data transport costs become more significant.
At a certain point, we may think it be worth optimizing. For example, message formats like JSON can be heavier than transfer-optimized formats like gRPC.
In cloud environments like AWS, data transfer costs between regions or availability zones are real and can add up quickly.
Being aware of these costs is essential. However, it has its tradeoffs. If we optimize too early, it might cause us more trouble than it's worth in the short run.
Fallacy #8: The Network is Homogeneous
While we like everything to be clean and tidy, the real world is far from it. As an example, we often need to transform one data format into another.
It is important to be compatible. Because of this, our systems will still work even if a new framework comes out or we need to use them in places they weren't designed for.
Interoperability does have some limits, though. We need to find the right balance.
We can save time and trouble in the long run if we keep in mind that not all systems are the same and avoid coupling our solution to one specific aspect.
Interesting Read
Some cool articles I read the past week:
I learned some of these the hard way!
Thanks for the mention, Fernando.
About the bandwidth.
You already talked in a previous issue about Protobuf https://newsletter.francofernando.com/p/protobuf?r=nw9bj&utm_campaign=post&utm_medium=web
In my experience, this technology is really useful in a distributed system to dramatically reduce the payload and the waste of bandwidth.