

Discover more from The Polymathic Engineer
Scaling
How PayPal scaled Kafka to handle 1+ Trillion of daily events. Plus using geoDNS and edge cache to build systems reaching a global audience.
Hi Friends,
Welcome to the 41th edition of the Polymathic Engineer newsletter.
Last week I read a very interesting blog post on the PayPal engineering blog and it would be great to discuss about it with you. In addition we will see how a possible way to scale a website so that it can support a global audience like Amazon or Uber.
The outline will be as follows:
A brief intro to Kafka
Kafka Cluster at PayPal
How PayPal scaled their Kafka system
How to build a website at global scale
A brief intro to Kafka
At high level, a Kafka system is a black box used by two entities: producers, and consumers.
Producers are backend services that publish events to Kafka, such as payment services, or customer support services. Events have a timestamp, value, optional metadata headers, and typically have a small size.
Kafka store these events into topics. They are conceptually similar to folders in a filesystem, and can be split onto multiple partitions for horizontal scaling across multiple machines. When a producer sends an event, it must specify the topic and optionally include the specific topic partition.
Consumers subscribe to messages from Kafka topics, but each partition can only send data to a single consumer, who is considered the owner of the partition. To have multiple consumers consuming from the same partition, they need to be in different consumer groups.
Consumers use a pull pattern, polling the server for new messages. Kafka maintains offsets to track which messages each consumer has read, and consumers can change their offset to replay previous messages.
Kafka clusters at PayPal
Internally Kafka is organized into multiple servers called brokers. PayPal runs more than 1,500 brokers worldwide and hosts over 20,000 topics.
Brokers receive messages from producers, store them, and then send them to consumers. Producers and consumers use custom libraries written in different programming languages to communicate with the brokers.
PayPal also uses ZooKeeper servers to track configuration data and help brokers stay in sync. PayPal still prefers ZooKeeper over KRaft, even though keeping these servers running is harder.
How PayPal scaled their Kafka System
These are the main strategies they adopted to make all this possible:
1. Cluster Management: Instead of having fixed addresses for their brokers, PayPal created a service that dynamically handles these address
2. Access Control Lists (ACLs): PayPal set up a system to ensure only authorized applications could access specific Kafka clusters and topics, enhancing security.
3. Monitoring and Alerting: PayPal gets a lot of data on how well their Kafka clusters work to ensure they are reliable. An alert is sent out if something strange happens, making it easier for the team to deal with the problem immediately.
4. QA Environment: PayPal built a separate replica world (QA environment) where developers can freely test things without affecting the real-world setup.
5. Topic Onboarding: Teams must make requests to start new topics. Before approving these requests, PayPal has a Kafka team that looks at them, makes sure they can handle them and sets up any security steps that are needed.
You can read more details in the original post on the PayPal engineering blog.
Scaling a website to support a global audience
A single datacenter is not enough to serve a global audience. Having multiple data centers is a must.
This configuration has many benefits:
1. it reduces the load on the single servers
2. it decreases the network latencies
3. it makes it possible to face outages
Anyway, supporting multiple data centers is not simple and requires specific solutions. There are 2 main components that can be added to the infrastructure to support this scenario: geoDNS and edgeCache servers.
geoDNS
A regular DNS server resolves a domain name to an IP address, like 207.193.49.52. A geoDNS behaves the same way from the client's point of view. The difference is that it serves IP addresses based on the client's location.
Clients connecting from the US get a different IP address than clients from Asia. So, clients from both the US and Asia can access servers closer to their location. The process of redirecting the clients to the closest data center minimizes the overall network latency.
Edge Cache
An HTTP cache server is a reverse proxy caching the HTTP traffic. Edge Caches are HTTP caches located around the world to be closer to the clients. When the client requests a web page, the request is received by the edge cache.
The server then can then decide between three options:
- serving the page from the cache
- assemble the page sending background requests to the origin server
- mark the page as not cacheable and delegate the request handling to the origin server
Interesting Tweets
AI can't yet replace the creativity and critical thinking skills required to solve a complex software development challenge. But it provides a big help in automating routine tasks, leaving you more time to focus on the core problems.
It's hard to stay motivated if you always feel that you're not good at something. But sometimes, we are too strict with ourselves, and we don't notice the little progress we make. Celebrating small wins helps.
Writing meeting notes and sharing them with the participants and other stakeholders is almost a must for me. It doesn't eliminate discussions and misunderstandings, but at least it reduces them.