Hi Friends,
Welcome to the 123rd issue of the Polymathic Engineer newsletter.
This week, we’ll have an overview on Chaos Engineering, an approach to testing that uses intentional failure to make distributed systems more resilient.
The outline will be as follows:
The cost of downtime
What is Chaos Engineering?
The Four-Steps of Chaos Experiments
Key Principles for Effective Chaos Engineering
Implementation Approaches
Real-World Examples
Project-based learning is the best way to develop technical skills. CodeCrafters is an excellent platform for practicing exciting projects, such as building your version of Redis, Kafka, DNS server, SQLite, or Git from scratch.
Sign up, and become a better software engineer.
The Cost of Downtime
These days businesses mostly depend on their online services. Even short outages can cost them a lot of money and hurt their image.
According to Forbes, the average cost of downtime for big businesses is now as high as $9,000 per minute. In some fields, like finance and healthcare, it can hit $5 million per hour.
Think about Facebook's failure in 2021 as an example. Because of a problem with how the DNS was set up, apps like Instagram, Facebook or WhatsApp were not accessible for seven hours. The company's main source of income, ads, didn't bring much money at this point. Besides that, all other companies that use Facebook for communication and marketing were also affected.
The more complicated systems get, the higher are the risks. Big tech companies have large distributed systems with a lot of microservices that communicate over networks.
Because of this, it is hard fully think about the whole system. The old ways of testing that focus on controlled settings and expected failures aren't working anymore. They can't account for all the different ways something could fail.
The rising cost of breakdowns and the increase in complexity pushed engineers to look for new ways to make systems more resilient.
This is where Chaos Engineering comes into play. By adding controlled mistakes to our systems on purpose, we can see how they work, find their weak spots, and make them stronger. We can find and fix security holes before they show up in production. This way, we don't have to wait for problems to appear out of the blue, usually at the worst possible time.
Big tech companies are starting to use this method more and more. In this article we will talk about how these businesses have come up with complex methods and tools to use Chaos Engineering on a large scale.
What is Chaos Engineering?
When you do chaos engineering, you test how well your system works when things go wrong. It's kind of like a fire drill for your stuff. Instead of waiting for big problems to happen, you set up small, controlled mistakes to see how your system handles them.
The idea comes from the scientific method. First, you make a hypothesis about what you think will happen. Then, you test it by making a certain kind of failure happen. At the end, you check to see if your system treated it the way you thought it would.
For example, you might wonder: "What happens if one database servers crashes?" Instead of guessing or waiting for it to happen naturally, you could deliberately shut down a database server in a controlled way and watch what happens.
This is different from regular testing. Testing normally means seeing if your code does what it's meant to do in everyday situations. When you do Chaos Engineering, you try how your whole system works when things go wrong or are stressful.
The goal isn't to break things for fun. It's to find weaknesses before they cause real problems. It's to find flaws before they become big issues. If you find that your system doesn't handle a certain kind of failure well, you can fix it before it happens to your customers.
People sometimes mix up fault injection with Chaos Engineering, but they're not the same thing.
Fault injection is when you add a specific error to your system to see what happens. Its main purpose is usually to test a certain part or function.
Chaos Engineering covers more ground. It's about finding problems you didn't expect and learning how your whole system reacts to stress. Chaos engineering includes more than just fault injection. It also involves planning experiments, coming up with theories, and looking at how changes affect the whole system.
Chaos engineering is the process of finding out how resilient your whole system is to different types of failures. Fault injection is the process of adding specific mistakes.
The Four-Steps of Chaos Experiments
As we’ve seen, Chaos Engineering isn't about randomly breaking things, but follows a careful, scientific approach with four main steps.
Let's look at each one:
Find the normal state. Before you start any experiment, you need to know what normal looks like for your system. As a first step, decide which metrics matter most for your application. These might include: how many queries your system can handle per second, how long it takes to respond to requests, error rates, CPU and memory usage. Then during your experiment, make sure you can keep track of these data. This will help you find out what changes when you add chaos.