

Discover more from The Polymathic Engineer
Partitions
A primer on database partitioning. Horizontal and Vertical partitioning plus the most common sharding strategies.
Hi Friends,
Welcome to the 42th edition of the Polymathic Engineer newsletter. This week we will talk about an important topic to scale any data intensive application: database partitioning.
The outline will be as follows:
What is partitioning
Vertical and Horizontal partitioning
Horizontal Partitioning strategies
Advantages and Disadvantages of partitioning
Interesting tweets
Partitioning
Partitioning is the process of splitting the data of a software system into smaller, independent units. Each one of those units is typically called a partition.
We can then assign one or more partitions to a single machine, where this machine will be will be in charge of serving requests for the associated data.
These requests can generally belong in two main categories: read requests that retrieve data and write requests that update data or create new data.
Horizontal vs Vertical Partitioning
There are 2 strategies for partitioning a table: vertical and horizontal.
Considering a table, vertical partitioning moves some columns to other tables. It's straightforward and puts together data related to specific features. The downside is that in case of growth, a further partition is required.
Horizontal partitioning divides a table into new tables with the same columns but fewer rows. Each new table represents a different data store that can be hosted on a different machine.
Horizontal Partitioning strategies
After using horizontal partitioning, each server has to deal with half the load. This introduces a problem, though: how does the application know which shard has the data it is looking for?
A simple way to do it would be to make the application aware of all the servers, so that it can send a query to all of them. These will all return "not found," except for the one with the data.
This works, but it's not the best option because we send queries that aren't needed, which put load on the server. Probably less load than the server that has the data, but it is possible to do better.
There are 3 popular schemas to horizontally partition (or shard) a table:
• hash-based
• range-based
• directory-based
Hash-based applies a hash function to one column to map rows to new tables. To ensure that entries are placed in the correct shards and in a consistent manner, the values entered into the hash function should all come from the same column. This column is known as a shard key.
A good hash function can evenly balance the load across the new tables. But adding or removing a data store means rehashing values and migrating data. During the migration it won’t be possible to write new data and the application could be subject to downtime.
Range-based splits the data using ranges of values in one or more columns. The main advantage is that is simple to implement. The application reads which range the data falls into and writes it to the corresponding partition. Moreover this solution facilitate data scan and range queries. The downside is that it can generate an unbalanced load between tables if ranges aren't chosen carefully.
Directory-based uses a dedicated lookup service that uses a shard key to keep track of which partition holds which data. The main advantage of this solution is flexibility. You can use whatever system or algorithm you want to assign data entries to partitions, and it’s relatively easy to dynamically add partitions using this approach. The downside of this approach is its additional complexity: connecting to the service impacts the performance and the lookup table can become a single point of failure.
Advantages and Disadvantages
The main benefits of partitioning are:
• horizontally scaling a database is easier
• reading and writing in parallel on multiple data stores increase the throughput
• availability increases since if one data store goes down, the others still operate
The main disadvantages of partitioning are:
• the complexity increases and application code is needed to fetch the data
• operations like back up, indexing, and changing schema are more difficult
• joins are inefficient since data has to be compiled from multiple data stores
Another thing to consider is that partitioning also uses network capacity. If you have a huge database in the cloud that need to be frequently being resharded, the bill will go up.
Interesting Tweets
In the end, you need both interview skills and hands-on experience. But projects are the way to go if you're a junior developer and are not interested in big tech.
The important thing is to understand why you do something in a project and consider if there's a better way. Link
I wouldn't say that applying for a job an a company’s portal is a complete waste of time. In Europe, finding a job applying in this way is still quite possible. But networking certainly gives you a big advantage over the other job searchers.
My wife has always clearly wanted to work after having a family. We live far from our parents abroad, so we used child care. But we found a good balance, and we all are happy. That's the most important thing.
I spent most of my career working in non-tech companies and don't regret it. Some things have been really gratifying. For example, traveling with a train moved by a locomotive running your software.