# Map And Reduce

### How the MapReduce algorithm works. Plus, how to use Hash Sets and Maps in coding interviews.

Hi Friends,

Welcome to the 16th issue of the Polymathic Engineer newsletter. This is a wild number for me.

Initially, I could never think to stay so consistent and weekly deliver this newsletter for 4 months in a row. And some weeks are more challenging than others because of personal and work commitments. But knowing that many of you will check the newsletter motivates me to keep going.

Thank you, everybody, for your support. That means a lot to me.

Today we will talk about:

How the MapReduce algorithm works

Use cases of hash sets and tables in coding interviews

Coding challenge

Three interesting tweets

### MapReduce

MapReduce is a popular framework designated initially by Google. It enables the processing of large distributed datasets efficiently and in a fault-tolerant way.

MapReduce has been introduced to run massive computations on TB of web-crawled data. The goal was to create an abstraction layer over a distributed environment of thousands machines.

The framework is based on 2 functions used by functional languages: Map and Reduce.

*Map *takes as input a list of elements and a function. It applies the function to each element in the list and then returns it. If the list contains the integers [1,2,3,4] and the function square a number, Map transforms the list to [1,4,9,16]

*Reduce *takes as input a list of elements, a function, and a starting element. It takes the starting element and an element from the list and combines them in some way. It then returns this combination and combines it with the next element in the list.

Suppose the list contains integers, and the function sums up two numbers. If the starting point is 0, and the list contains [1,2,3,4], Reduce generates as output 10. The following steps can summarize the internal functioning of the MapReduce framework.

First, it looks at the input files and splits them into multiple pieces. Second, it starts copies of the MapReduce program on a cluster of machines. One of the copies will be the master and the rest workers. The master assigns Map and Reduce tasks to the workers.

The user can specify the number of Map and Reduce tasks. Third, a worker assigned to a Map task reads the contents of their part of the input. It parses key/value pairs from the input and passes them to the user-provided Map function.

The output of the Map function are intermediate key-value pairs. These pairs are partitioned and their storage location is passed back to the master. The locations of the intermediate pairs partitions are the inputs for the Reduce workers.

A Reduce worker reads the pairs and groups all occurrences of the same key together. Then it passes the key and the corresponding set of values to the user-provided Reduce function. The output of the Reduce function is appended to an output file.

Once all the Map and Reduce tasks are complete, the master returns the control to the user. The user can then access the output of the MapReduce execution through the output files. An output file for each of the Reduce workers is available.

If you want to dig deeper, I strongly suggest reading the original MapReduce paper.

### Hash Table and Maps Use Cases

Hash maps and sets are data structures frequently used during code interviews.

##### Hash sets: checking the existence

The most common application of hash sets is determining if an element exists.

Checking the existence of an element in an array is a linear time operation. Hash sets make it possible to perform the check in amortized constant time.

Let's imagine you have an algorithm that needs to check the existence of an element M times. With an array, the time complexity of this algorithm would be O(NM). where N is the length of the array. With a hash set, the time complexity reduces to O(M).

An example could be an algorithm to detect the first duplicated letter in a string S.

The brute force solution is to iterate over S, checking for each character if it is duplicated. Using a hash set to store the characters reduces the complexity from quadratic to linear.

##### Hash maps: counting the frequencies

The most common application of hash tables is counting the frequency of the elements. The hash map translates keys to integers, and each integer represents the occurrences of the corresponding key.

For example, consider the problem of verifying if all characters in a string have the same frequency.

A hash map is the ideal data structure to count all character frequencies in linear time. A single iteration over the map can then check if the frequencies are unique.

### Coding Challenge

The coding challenge for this week was Online Stock Span.

The problem asks to implement an algorithm that collects a stock's daily price and returns the price span for the current day.

The **span** of the stock's price in one day is the number of past consecutive days for which the stock price was less than or equal to that day's price.

The brute force approach would store the daily stock's prices in an array, then iterate backward until a price exceeds the current price. That works, but it could be more efficient since it often requires iterating over the same elements. As a consequence, the time complexity would be quadratic.

The key observation to get a better solution is that while iterating backward, we no longer need to care about all the prices that are lower than the current one.

It's easy to understand why with an example. Suppose the prices for the first 4 days are [7, 2, 4, 6]. Under this hypothesis, the span for day fourth day would be 3 because 2 and 4 are less than 6. So then, if the price for the fifth day is lower or equal to 6, its span would be the one of the fourth day plus 1.

This is the perfect use case for a monotonically decreasing stack. The stack will store only decreasing prices together with their span. When a new price is processed, all the prices lower or equal to the current one are removed from the stack, and their span contributes to the one of the current day. Finally, the current price and its span are pushed into the stack.

Here is a possible implementation of the algorithm.

The time complexity for calculating the span of the current day is amortized O(1). Indeed each element can only be pushed and popped off the stack once, and stack operations are O(1). So on average, for N elements, the complexity is constant. The space complexity is O(N) because of the stack.

The coding challenge for the next week is Asteroid Collision.

### Interesting Tweets

Processor architecture has been one of my favorite topics at university. And it became convenient for all the years I worked on embedded systems. So every software engineer should have a basic understanding of it.

Hussien is right here. Another thing you to realize is that being successful in interviews requires not exactly the same skills you need to be successful in your job. So it is important to build up those skills too.

When doing a postmortem analysis, it is important to avoid making it seem like a single person is responsible for what happened. Even if a single person had caused the failure, it could have been anyone else on the team.