Hashing
What are hash tables and how they are implemented under the hood. Plus an introduction to specialised data storage for distributed systems.
Hi Friends,
Welcome to the 14th issue of the Polymathic Engineer newsletter.
Today we will talk about:
How hash tables are implements
Specialized data storage for distributed systems
Coding challenge
Three interesting tweets
Hash Tables
Hash tables are one of the most used data structures. Here is how they work under the hood.
A hash function is a function that takes any input and converts it to an integer. Inputs are called keys; the output integer is always less than a fixed value. The output is deterministic. The same input is always converted to the same integer.
Using a hash function can convert any key to the index of an array. In this way, the hash function allows mapping keys to values of the data type stored in the array. A hash table is precisely this combination of hash functions and arrays.
However, a hash function doesn't guarantee that different keys map to various array indexes. When two keys map to the same position, there is a collision. Collisions must be handled; otherwise, keys will get overridden, and values will be lost.
There are two main techniques to handle collisions: chaining and open addressing. The idea of chaining is to store in the arrays linked lists instead of values. Each linked list holds the key-value pairs of all the keys mapped to each array's index.
Open addressing doesn't use additional data structures, and it's based on probing. If a collision occurs, an alternative location is searched in the array. The alternative location can be at a fixed interval or computed by another hash function.
Collisions affect the performance of hash tables and need to be minimized. Choosing a prime number as the size of the array is an excellent way to do this. Because of collisions, hash tables' primary operations (insert, delete, lookup) are O(N) in the worst case.
With a good hash function, collisions are less frequent, and those operations are O(1) on average. As a last note, hash tables are unordered data structures. This means that keys do not sort values and the insertion order is not preserved.
Specialized data storages
Relational databases and key-value storages are the most used in distributed systems. Nevertheless, many systems effectively use other types of more specialized storage. Here is an introduction to them.
Time series
Time series are specialized storage for large amounts of data related to a specific time. They are optimized to measure data changes over time and perform statistical computations.
Typical use cases of time series are:
• monitoring system's parts with many simultaneous events
• collect telemetry data in IoT systems with many devices
• dealing with financial data (stock, cryptocurrencies)
Popular implementations are Influx DB and Prometheus.
Blob storages
Blobs are specialized storage for storing and retrieving massive amounts of unstructured data. Blob is indeed an acronym for large binary objects. Images, videos, extensive text data, and compiled code are all examples of such things.
These databases are complex and optimized for availability and durability. From a user perspective, they look like key-value storages since data is accessed through a key.
Google cloud storage, Amazon S3, and Azure blob storage are popular implementations.
Graph databases
Graph databases are specialized storage for storing data with many relationships. Data are stored according to the graph data model with entities and relations between them. Executing queries on these graphs looks like traversing a graph.
Executing the same queries on relational databases would be more complicated and costly. Indeed, the number of joins between tables would be high. The most popular graph database implementation is Neo4j with its query language Cypher.
Spatial databases
Spatial databases are specialized storage for storing spatial data, like locations on a map. They rely on spatial indexes like k-d tree or quadtree and can efficiently perform spatial-related queries.
Quadtrees are the simplest and are conceptually similar to a grid. The outer grid rectangle represents the root and contains all the possible spatial locations.
Each rectangle is recursively divided into 4, and a quadrant represents the children of a node. The division stops in quadrants with less than a specific number of locations.
So geographical areas with many locations are divided into more quadrants. Spatial queries using quadtrees are efficient, requiring a base four logarithmic time. Databases with good geospatial support are PostgreSQL, MongoDB, and Elasticsearch.
Coding Challenge
The problem has as input an integer array and a subarray, and it asks to find the next greater element for each subarray element.
The brute force solution is to search each greater element iterating through the array. The time complexity of this solution is quadratic.
A better solution is iterating through the array inserting its element in a decreasing monotonic stack. Such a stack contains only elements in decreasing order. When a greater element x comes, all the elements lower than x are popped out from the stack, and x is pushed. The pushed element is the greater element for all the popped elements.
A hash map can effectively store the correspondences between each element and the greater to build up the requested result.
The time complexity is O(N) because each element is pushed in and popped out of the stack only once.
The space complexity is O(N) for the stack
Here is a possible implementation.
Interesting Tweets
Asking questions is the correct approach, especially with less experienced developers.
Even if you could give them a solution directly, finding the solution by answering questions is much more valuable in the long term. It is what helps them to grow.

Another good piece of advice is to check on LinkedIn people who previously covered the offered position in that company.
Analyzing their career trajectory can also give good insight into the company.

This tweet exemplifies how viral tweets can be dangerous, especially for people with less experience.
First, seniority has nothing to do with contributions on a public GitHub profile or open source contributions.
Second, most of the code people write is in the private repositories of the companies they worked for.