Hi Friends,
Welcome to the 85th issue of the Polymathic Engineer newsletter. This week, we discuss a system design case study: how Booking implements Observability.
Booking.com is a website where people can plan their travels, looking for hotels, resorts, flights, and other things.
The platform was founded in 1996 and has grown exponentially, now serving customers all over the world. The company's backend is based on a service-oriented architecture and runs on a hybrid cloud setup.
Observability is critical in this complex scenario for scaling the system, fixing issues, and recovering from failures.
In this issue, we will look at the standard way to implement an observability system and see how Booking uses a different approach.
The outline will be as follows:
The three pillars of observability
What are events and why booking uses them
How booking implemented events
The three pillars of observability
The goal of an observability system is to provide a complete picture of a backend's health and performance. We usually get this visibility through three data outputs: logs, metrics, and traces.
Software engineers commonly refer to them as the “three pillars of observability.”
Logs are records of system events. They have a timestamp and capture various levels of detail, from general information to critical error messages. Most of the time, logs are kept in a database like Elasticsearch in raw or JSON format.
Different sources generate logs and give us a qualitative view of the backend's behavior. The code running on the server generates the application logs; the operating system or hardware devices produce system logs; routers, load balancers, firewalls, etc., create network logs.
Logs are essential for troubleshooting, but they are noisy and do not give a higher-level overview of the state of your system.
Metrics are numerical data points that provide quantitative information about the backend's performance over time. They typically include CPU and memory usage, request and error rates, and more.
Metrics are commonly stored in a time series database and help identify trends and anomalies through statistical models and predictions.
Traces provide a detailed view of how a single request flows through multiple services in the backend. They are critical for finding performance problems and knowing how services depend on each other.
A way to set up traces is to assign each request a unique ID and track it at specific points in the backend (i.e., a call to another service, a database query, a cache lookup, etc.)
What are events, and why does Booking use them
Instead of relying on logs, metrics, and traces, Booking built its observability system on events.
An event is a key-value data structure containing various information generated by a single “unit of work,” such as an HTTP request, a cron job, a background task, etc.
For example, an event can include the duration and latency of a request, warnings or errors generated, and which services are processing the request.
{
"availability_zone" : "london",
"created_epoch": "1660548925.3674",
"Service_name": "service A",
"git_commit_sha": "..",
…
}
Events are more structured than logs or metrics and offer some key benefits.
First, events have full context about a given unit of work, such as the runtime environment, performance data, and more. This information can be used to generate the classical observability pillars and allows analytics queries to run directly over the event data.
Second, events show which errors affect users, all the services involved, any feature flags, and more. They keep track of states across many backend parts and give a fuller picture.
How Booking implemented events
Booking has its own events library that sends events to an event-proxy daemon, which runs on all the machines.
Among other things, the daemon adds information to events, sends events to different Kafka topics, and splits Kafka messages.
Several consumers can use Kafka to get data:
the distributed tracing consumer turns the data into traces using Honeycomb.
the APM generator generates application performance monitoring metrics (e.g., number of requests, latencies) and stores them in Graphite.
the failed event processor looks for Events with error messages and writes them to Elasticsearch so that engineers can use them for debugging.
References
Food for thoughts
If you keep working on a skill every day, you will have no choice but to become an expert at that skill. The days add up quickly.
As an IC, I'd like to have skip-level 1:1 meetings more often. You can learn a lot from them about other projects and the whole organization.
Interesting Reads
Some interesting articles I read this week:
Interesting article Fernando.
So basically, if I understand correctly, Booking com is still getting logs, traces and metrics. Just that they are using events as the origin of this data. This allows like one entry point on the machine-level for all three types of data.
Also, thanks for the mention!
One critical piece of observability is the use of structured information.
There is something magical in the ability to query what happens in your systems.
I loved the article; thanks for the mention, Fernando!