The Challenges of Distributed Systems
What every engineer should know when working with distributed systems.
Hi Friends,
Welcome to the 129th issue of the Polymathic Engineer newsletter.
In this post, we'll explore the fundamental challenges that need to be solved to design, build, and operate effective distributed systems. Understanding these challenges is crucial for any engineer working with modern applications.
The outline is as follows:
What is a distributed system?
Communication
Coordination
Scaling
Resiliency
Maintainability
Project-based learning is the best way to develop technical skills. CodeCrafters is an excellent platform for practicing exciting projects, such as building your version of Redis, Kafka, DNS server, SQLite, or Git from scratch.
Sign up, and become a better software engineer.
What is a distributed system?
If we simplify things, a distributed system is a group of nodes that work together to complete a task by exchanging messages over communication links. A "node" here can refer to different things depending on the perspective from which we view it.
From a hardware point of view, a node is a real thing, like a computer, a mobile phone, or a tablet. From a runtime perspective, it is a software process that talks with other processes via inter-process mechanisms, such as HTTP. From an operational point of view, it is a group of services that are loosely connected and talk to each other through APIs. Each service implements a specific functionality, with business logic at its heart and connecting to the outside world.
But why do we even need to build distributed systems? There are many strong reasons for this. The main one is that some applications are naturally distributed. The web is a great example of a distributed system because you use it every day on your devices.
Another important factor is high availability. Some systems must be able to work even if a single node fails. Consider how Dropbox creates copies of your data across various nodes, ensuring that if one fails, you won't lose your files forever.
Then there's the issue of scaling. Sometimes, a single node can't handle all the work, no matter how powerful it is. Tens of thousands of search requests come to Google every second from all over the world. That's too much for a single computer to handle.
Finally, distributed systems are necessary to fulfill performance requirements. Netflix has data centers carefully placed close to you, so it can stream high-resolution movies to your TV without any problems.
Despite all these reasons, distributed systems are a complex and tricky beast. In the following sections, we'll explore the main challenges that must be taken into account when designing and building effective distributed systems.
Communication
Nodes in a distributed system need to communicate over a network to work together. At first glance, this may seem straightforward, but it raises several complex issues.
For example, when your browser wants to load a website, it first uses the URL to find the server's IP address and then sends an HTTP request. The server takes this request and sends back a response with the page's content.
However, there is a lot more going on behind the scenes. How do these request and response messages look like? What make sure they are sent correctly? What happens if the network goes down for a short time or if a bad network switch messes up bits in the messages? How does the server make sure that no one else can read the data that is being sent?
Networking libraries take care of many of these things for you, but you need to know what's going on inside.
First, networks don't always work. Packets can get lost, copied, or sent in the wrong order. It's important that your system has ways to find and fix these problems, like timeouts, retries, and sequence numbers.
Second, networks introduce a delay between sending a message and getting an answer, known as latency. This delay changes depending on how far away the devices are, how busy the network is, and the routes they take. High latency can have a big effect on speed and the user experience.
Third, there is only so much network power. The amount of data we need to send has grown a lot over the years, but so has the bandwidth. Large payloads can overload network links, which can cause congestion and make delays even worse.
Fourth, networks aren't always safe. Malicious people can read, or even change messages. To make sure communication is safe, you need to use encryption, authentication, and integrity checks.
Lastly, networks can partition. Network problems can make it so that nodes that should be able to talk to each other can't. Your system has to decide what to do with these partitions: should it favor availability or consistency?
Because of these problems, distributed systems are naturally more difficult to understand than single-node systems. It's challenging to perform simple tasks when they require interaction across network boundaries.
Coordination
One of the most challenging aspects of distributed systems is getting different nodes to collaborate effectively. Even if every node works properly on its own, getting them to work together as a single system is hard, especially when things can go wrong.
There is a well-known puzzle called the "two generals" problem that shows how tough this can be.
Imagine two generals, each with their army. They need to agree on a time to attack a city together. They are some distance apart, and the only way for them to talk is to send messengers, but the enemy could catch them, just like messages could get lost in the network.
General 1 could send General 2 a message with a possible time to attack. What if the messenger is caught, though? General 1 wouldn't know if the message got through. You might think that General 2 could send a confirmation message, but they wouldn't know if that message got through, either.
Surprisingly, no matter how many confirmation messages they send back and forth, neither general can ever be 100% certain that they'll attack at the same time. This simple example shows why it's so hard to coordinate in distributed systems.
The problem shows up in many ways:
If more than one node needs to agree on who the leader is, how do they do it if some nodes fail or can't be reached?
What does the system do if two people try to book the last seat on a flight at the same time? How does it make sure that no one books the same place twice?
If a database is replicated across multiple servers for reliability, how do you ensure they all have the same data when updates happen?
These coordination problems get even more challenging at scale. It's more likely for something to go wrong when there are more nodes. When nodes are far apart, latency builds up, making it hard to coordinate quickly.
To deal with these problems, engineers have come up with a number of algorithms and methods, such as distributed consensus protocols, distributed transactions, and eventually consistent systems. But each comes with its trade-offs among consistency, availability, and performance.