Spark
What is Apache Spark, and how its architecture works. Plus, an overview of mocking interview services.
Hi Friends,
Welcome to the 18th issue of the Polymathic Engineer newsletter.
I had a very tough week with flu and cough, but I finally feel better and fully recharged. In the meantime, I could join a "Portfolio of Small Bets" cohort.
Maybe I won't start my own business, but learning things like reducing the risk of failure or using randomness to your advantage has been extremely interesting. I'm looking forward to the other session of the cohort this week.
Today we will talk about:
What is Apache Spark and its high level architecture
Mock interviews services
Coding challenge
Three interesting tweets
Apache Spark
Apache Spark is an open-source framework for computations on large, distributed data. It was introduced to overcome MapReduce's limitations, and its popularity exploded.
The foundation of Spark is a data structure called Resilient Distributed Dataset (RDD). An RDD is an abstraction representing read-only objects split across a computing cluster. RDDs are stored in memory and manipulated through transformations and actions.
RDDs support all the standard MapReduce functions, plus joining, filtering, and aggregation. Spark takes care of distributing these operations across multiple nodes for parallel processing. A driver and multiple executors do RDDs processing.
A Spark program starts with the driver creating a spark context. A context is an orchestrator that considers the code and determines the possible tasks to be performed. The context uses a cluster manager to coordinate the executors and schedule the tasks.
The scheduler tracks all the data transformations using a Directed Acyclic Graph (DAG). This enables a lazy evaluation. First, the scheduler divides the DAG into different stages. Then the scheduler runs various optimizations to remove any redundant computation.
The cluster manager also takes care to launch the executors dynamically. Executors are nothing more than Java applications residing in worker nodes. They run a task and return that result to the driver.
On top of this core engine, Spark offers other modules like:
SQL for working with structured data
MLlib to provide a distributed machine learning framework
Structured Streaming to process real-time streaming data from Kafka or Kinesis
GraphX to manipulate graphs and run algorithms for traversal, connections, and so on
All those libraries allow Spark flexibility to handle a diverse range of workloads. Such flexibility and the fast RDDs in-memory processing make Spark shine over competitors.
As a data processing program, Spark runs on top of a distributed data storage layer. Hadoop Distributed File System (HDFS) is a common choice here. But it's possible to use other storage layers like MongoDB, HBase, Cassandra or Amazon S3.
Mock Interviews
Mock interviews are a great way to prepare for technical interviews.
Here are 5 mock interview services I've tried with their pros and cons. Keep in mind that I have no affiliation with those websites. I only provides my personal experience using them.
Pramp
A popular free service where you can practice there any technical interview. You alternate as both the interviewee and the interviewer in each session. The downside is that sometimes the service doesn't match you with the right person. Then you can waste time.
Expert Mitra
The service with the best value for money I've ever tried. The interviewers are very professional and provide good feedback. They helped me understand my strength and weakness in design, coding, and behavioral interviews. I highly recommend it.
Gainlo
I tried this only once for coding interviews, and the overall experience was positive. I got short feedback immediately at the end of the interview and detailed one by mail later on. They have experienced interviewers, but the service is costly.
CareerCup
This service allows you to contact interviewers directly and schedule an appointment. The selection of interviewers is minimal, but they are all superb. I used it mainly for system design and learned a lot. The only drawback is the price.
TechMockInterview
I had a great experience doing mock coding and systems design interviews with them. People are generally very professional and can provide good feedback on improvement. Interestingly, they also offer a mentorship program, even in the beta version.
A reader also reported good experiences using interviewing.io. I’ve never tried it personally, but the pricing looks competitive.
Coding Challenge
The coding challenge for this week was Removing Stars From a String. The problem takes as input a string s, which contains stars *. The goal is to remove for each star the closest non-star character to its left and the star itself.
The best way to approach the problem is to build the output string manually. We can iterate over the input string, and process a single character each time. We put the character in the output string if it is not a star. If the character is a star we remove the last inserted character without placing the star in.
This sounds like a LIFO behavior we could get using an array. We could explicitly use a stack data structure or simulate the stack behavior while building the output string.
In the following code snippet, I did this using a StringBuilder:
The time complexity of the solution is O(N), because we iterate over the whole input. The space complexity is O(1) because we build directly the solution without using extra space.
The coding challenge for the next week is Permutations in string.
Interesting Tweets
Raul is right here. Unit tests shouldn't care about a class's inner workings. This misses the point of a unit test and makes the class harder to refactor. Indeed, the corresponding unit test has to change more often as well.

Here is Switzerland there are controversial opinions about what happened to Credit Suisse, but this tweet impressed my for another reason.
I remember people doing the opposite reasoning when I left my university career to work professionally as a software engineer. But the truth is that nowadays there is no secure job.
So the best strategy is to bet on yourself, differentiate your skills and keep them sharpened.

I also think that Computer Sciences courses should be more practical-oriented, but they're not obsolete or irrelevant.
They are still the most comfortable path to be a SWE and build a foundation of solid skills.