Hi Friends,
Welcome to the 72nd issue of the Polymathic Engineer newsletter.
This week, we discuss one of the most important topics concerning database usage: indexing. Understanding indexes is not just academic; it's a practical skill that significantly improves the performance of database systems.
The topic is quite huge, so I split it into two issues of our newsletter. The outline of the first part is as follows:
What is a Database Index?
How do Database Indexes work?
Hash indexes
B Trees and B+ Trees
LSM Trees
B Trees vs LSM Trees
What is a Database Index?
Understanding database indexes can be challenging, so let's start simple and grow from there.
Imagine you have a vast book and you want to find any information. Without the index at the back, you would have to flip through every page until you find your topic of interest. This is not only time-consuming but also not practical.
A database index works on a similar principle but is designed to be much more efficient and necessary for dealing with large volumes of data.
Database indexes are special data structures that improve the speed of data retrieval operations on a database table at the cost of additional writes and storage space.
Much like the index in a book directs you to the correct page, a database index directs the database engine to the disk location where the desired data resides.
Without an index, the database must perform a "full table scan," which involves looking at every row in a table to find the relevant ones. With an index, it can skip large portions of the data and reduce the number of disk accesses.
This capability is crucial as databases grow and data retrieval becomes more complex and slower.
However, indexes come with their trade-offs: they require additional disk space and have a maintenance cost. The index must also be updated whenever data is added, deleted, or updated.
While this introduces some overhead for write operations, the time saved during read operations often justifies the trade-off.
How do Database Indexes work?
Indexes provide a structured and ordered representation of data, independent of where it is stored on the disk.
This logical ordering is critical because physically moving data to place new entries would be prohibitively slow and inefficient.
There are many different data structures that can be used to implement indexes: hash tables, B Trees, LMS Trees, and so on. In the next sections, we will examine the most commonly used ones and their trade-offs.
Before starting, there is one thing you should keep in mind. In the following sections, we will use the generic word key, but this word corresponds to different things depending on which kind of database you are considering.
For example, in relational databases, the keys correspond to the values of a column to index; in a document database, they correspond to documents; and in a graph database, they correspond to vertices.
Hash Indexes
Hash indexes are the most straightforward indexing strategy. Like hash tables in programming, they manage data as key-value pairs.
They use a hash function to map keys to the location of the data on the disk, enabling speedy data retrieval.
When new data is written on the disk, the hash table is updated to reflect its position. The process is the same for inserting new keys and updating existing ones.
When a key is looked up, the index uses the hash function to determine where to retrieve the associated data from the disk.
This index type is well-suited for scenarios where data access patterns predominantly involve equality comparisons. It allows quick data writing and reading and requires minimal disk I/O if the filesystem already caches the data.
However hash indexes have also limitations: