Indexing - Part 2
A deep dive into database indexing. Part two: clustered indexes, secondary indexes, multi-column indexes, full-text indexes. Plus database indexing strategies.
Hi Friends
Welcome to the 75th issue of the Polymathic Engineer newsletter.
This week, we will complete our analysis of database indexes that we started three weeks ago. You can read it here in case you missed it.
The outline will be as follows:
Clustered vs non clustered indexes
Primary vs Secondary Indexes
Multi-column indexes
Full-text Indexes
Indexing Strategies
Clustered vs. Non-Clustered Indexes
In the first part, we discussed the most common data structures used to implement database indexes in terms of keys and values.
The key is what queries look for, but the value can be one of two things: the actual data or pointer to the data stored elsewhere, typically in a heap file.
A heap file is a storage structure where rows are stored in no particular order. It can be append-only or be used to keep track of deleted data to overwrite it with new data later. Both strategies have pros and cons.
Overwriting data is more efficient for updates without key changes, as records can be overwritten in place. However, if the new value is larger than the old one, it must move to a new location, complicating matters.
Moving from the index to the heap file has a performance penalty on reads. To mitigate this, databases allow the creation of clustered indexes that store the indexed data directly within the index itself and not in the heap file.
A clustered index reorders the way data is physically stored on the disk. It does not store data randomly or even in the order they were inserted. Instead, it organizes them to align with the index's order. The specific key used to arrange this order is called the clustered key.
There are a few crucial points to keep in mind about clustered indexes:
A table can only have one clustered index because the physical data can be sorted in only one order.
Adding a clustered index or changing it can be time-consuming as it requires physically reordering data. For example, PostgreSQL stores data in the order of insertion by default, but the
CLUSTER
command can be used to reorder it to match a specific index. However, this order is not automatically maintained and must be rerun to keep the order.It's best to pick a unique, sequential clustered key to avoid duplicate entries and minimize page splits when inserting new data. In MySQL, the primary key of a table is always a clustered index and secondary indexes refer to the primary key.
On the other hand, non-clustered indexes are more similar to the index at the back of a book. They maintain a distinct list of keys, each with a pointer indicating the location of the data containing that value.
The most important points to remember about non-clustered indexes are:
They are stored separately from the data rows, so the physical order of the data isn’t the same as the logical order established by the index.
You can have multiple non-clustered indexes that point to different locations in the heap file. Each index is helpful for various types of queries.
Accessing data using a non-clustered index involves at least two disk reads: one to access the index and another to access the data.
Covering indexes are a compromise between clustered and non-clustered indexes. They store some of a table’s columns within the index itself, allowing some queries to be answered using only the index, thus covering the query.
Using a covered index can significantly improve the read performance of some queries, as the index alone can fully satisfy them. However, as with any index, it requires additional storage space and adds overhead on writes due to partial data duplication.
Covering indexes makes more sense when you have tables with many rows and want to run queries that return a small subset of columns.
Primary vs Secondary Indexes
A primary index uniquely identifies one row in a relational table, one document in a document database, or one vertex in a graph database.
The primary index ensures that each record in the database is unique, which is critical for maintaining data integrity. Other records in the database can refer to this record by its primary key (or ID), and the index is used to resolve such references.
The primary index is typically the main means of accessing data. When creating a table, the primary key often doubles as a clustered index, meaning the data in the table is physically sorted on disk based on this key. This ensures quick data retrieval when searching by the primary key.