Push and Pull
Push and pull architectures in distributed systems. Plus, how Twitch handles a huge video streaming traffic.
Hi Friends,
Welcome to the 20th issue of the Polymathic Engineer newsletter. I took a week of break from my newsletter to enjoy the Easter vacation with my family entirely. We spent a few days with friends in Konstanz, a beautiful town on the Bodensee in Germany. We dedicated the remaining days to sports and kids' activities. The last period at work has been quite challenging, and I needed this recharging break.
Today we will talk about:
Video stream processing at Twitch
Pull vs Push architectures
Coding challenge
Interesting tweets
Video Streaming at Twitch
Twitch is a streaming platform where content creators can stream live videos to their audience. There are 10+M daily active users streaming videos using different media.
To processes such a vast amount of Video Streams, Twitch maintains approximately a hundred servers in different geographic regions. Such servers, distributed worldwide, are called Points of Presence (PoPs). Both streamers and viewers connect to PoPs for uploading/downloading videos.
PoPs are connected through a private Backbone Network dedicated to transmitting their content. This is ensures that the Twitch traffic is not affected by the instability of the public Internet. Between the PoPs there are geographically distributed origin data centers.
The origin data centers handle tasks regarding video processing. For example, they are responsible of tasks like transcoding a livestream into different bitrates or convert a livestream to different video formats.
Typically a video does the following path inside the Backbone Network:
- from a streamer’s device to a PoP
- from the PoP to an origin data center
- from the origin data center to all the PoPs that are close to the stream’s viewers
A proprietary ingest routing system handles all the traffic between PoPs and origin data centers. Its architecture consists of two components: a Media Proxy and a Routing Service. The Proxy is a data plane component that runs in each PoP.
The only functionality of the Media Proxy is sending the video streams to various origin data centers. The Routing Service is a control plane component that runs in AWS. Its functionality is to tell the Proxies which origin data center to send the videos to.
The Routing Service monitors Twitch's infrastructure in real-time and runs a randomized greedy algorithm. The algorithm uses the information about the infrastructure to optimize routing decisions. Its goal is to minimize viewers' latency and maximize the origins' resource usage.
The Routing Service relies on 2 other services implemented in AWS: Capacitor and Well. Capacitor monitors the computing resources in each origin, tracking fluctuations due to maintenance or failures. Well monitors the backbone network and provides information about its status.
For more details about how Twitch built this infrastructure, you can read the full blog post here.
Push vs Pull architectures
Many scenarios in distributed systems have producers creating data and consumers needing data.
In a push architecture, producers send data to consumers. In a pull architecture, consumers request data from producers. Depending on the needs, one of the two architectures can be preferable.
Here is a list of advantages and disadvantages according to different parameters.
Latency. With a push model, producers send data as soon as it becomes available. With a pull model, consumers need to request data with round-trip communication. So a push architecture can reduce the latency in the system.
Fault tolerance. With a pull model, if a consumer fails to retrieve data it can figure out if producers are available and retry the request. With a push model, a consumer may not be aware of producers' failures. So a pull architecture can increase fault tolerance.
Consumption rate. With a push model, consumers don't control the consumption rate and can get overwhelmed. With a pull model, consumers can request data at a rate they can process. So a pull model makes it easier to have consumers with different process power.
Efficiency. A naive pull-based system can lead to busy waiting when no data is available. Push-based systems don't have such a potential waste of resources. However, pull-based systems are more suitable for batch processing, optimizing the throughput.
Neither the pull nor push model is intrinsically better. It all depends on the specific scenario and use cases. Indeed in many cases, systems support both or use them in combination.
Coding Challenge
The coding challenge for this week is Validate Stack Sequences.
The problem has two integer arrays with distinct values as input, representing a possible sequence of push and pop operations from a stack. The goal is to check if this sequence is valid.
This problem is where using a brute-force solution and simulating the process is a good idea. There are two critical observations for that:
you need to put the elements in the push sequence in order into the stack
you can remove each element in the pop sequence only after being pushed
So the algorithm iterates over the elements of the push sequence. At each step, it pushes an element of the push sequence into the stack and tries to remove as many elements of the pop sequence from the stack as possible.
Here is a possible implementation of the algorithm:
The time complexity is O(N), where N is the length of the sequences. The space complexity is O(N) because of the stack.
The coding challenge for the next week is Maximum Difference Between Node and Ancestor.
Interesting Tweets
Twitter is strongly censoring links to Substack and limiting copy-pasting of tweets into a Substack post. So in this newsletter edition you will find screenshots of tweets without the link. This is silly because those links were a Twitter traffic source.
This is great advice for two reasons. First, what worked for others could not work for us because circumstances and skills are different. Second, people often share on social media roadmaps that they never followed.
I get why Twitter blocked outgoing traffic to Substack but don't understand why they'd block Twitter embeds. It seems like free traffic for Twitter