The Polymathic Engineer

The Polymathic Engineer

Share this post

The Polymathic Engineer
The Polymathic Engineer
Indexing - Part 2
Copy link
Facebook
Email
Notes
More

Indexing - Part 2

A deep dive into database indexing. Part two: clustered indexes, secondary indexes, multi-column indexes, full-text indexes. Plus database indexing strategies.

Franco Fernando's avatar
Franco Fernando
Jun 08, 2024
∙ Paid
13

Share this post

The Polymathic Engineer
The Polymathic Engineer
Indexing - Part 2
Copy link
Facebook
Email
Notes
More
Share

Hi Friends

Welcome to the 75th issue of the Polymathic Engineer newsletter.

This week, we will complete our analysis of database indexes that we started three weeks ago. You can read it here in case you missed it.

The outline will be as follows:

  • Clustered vs non clustered indexes

  • Primary vs Secondary Indexes

  • Multi-column indexes

  • Full-text Indexes

  • Indexing Strategies

Clustered vs. Non-Clustered Indexes

In the first part, we discussed the most common data structures used to implement database indexes in terms of keys and values.

The key is what queries look for, but the value can be one of two things: the actual data or pointer to the data stored elsewhere, typically in a heap file.

A heap file is a storage structure where rows are stored in no particular order. It can be append-only or be used to keep track of deleted data to overwrite it with new data later. Both strategies have pros and cons.

Overwriting data is more efficient for updates without key changes, as records can be overwritten in place. However, if the new value is larger than the old one, it must move to a new location, complicating matters.

Moving from the index to the heap file has a performance penalty on reads. To mitigate this, databases allow the creation of clustered indexes that store the indexed data directly within the index itself and not in the heap file.

A clustered index reorders the way data is physically stored on the disk. It does not store data randomly or even in the order they were inserted. Instead, it organizes them to align with the index's order. The specific key used to arrange this order is called the clustered key.

There are a few crucial points to keep in mind about clustered indexes:

  • A table can only have one clustered index because the physical data can be sorted in only one order.

  • Adding a clustered index or changing it can be time-consuming as it requires physically reordering data. For example, PostgreSQL stores data in the order of insertion by default, but the CLUSTER command can be used to reorder it to match a specific index. However, this order is not automatically maintained and must be rerun to keep the order.

  • It's best to pick a unique, sequential clustered key to avoid duplicate entries and minimize page splits when inserting new data. In MySQL, the primary key of a table is always a clustered index and secondary indexes refer to the primary key.

On the other hand, non-clustered indexes are more similar to the index at the back of a book. They maintain a distinct list of keys, each with a pointer indicating the location of the data containing that value.

The most important points to remember about non-clustered indexes are:

  • They are stored separately from the data rows, so the physical order of the data isn’t the same as the logical order established by the index.

  • You can have multiple non-clustered indexes that point to different locations in the heap file. Each index is helpful for various types of queries.

  • Accessing data using a non-clustered index involves at least two disk reads: one to access the index and another to access the data.

Covering indexes are a compromise between clustered and non-clustered indexes. They store some of a table’s columns within the index itself, allowing some queries to be answered using only the index, thus covering the query.

Using a covered index can significantly improve the read performance of some queries, as the index alone can fully satisfy them. However, as with any index, it requires additional storage space and adds overhead on writes due to partial data duplication.

Covering indexes makes more sense when you have tables with many rows and want to run queries that return a small subset of columns.

Primary vs Secondary Indexes

A primary index uniquely identifies one row in a relational table, one document in a document database, or one vertex in a graph database.

The primary index ensures that each record in the database is unique, which is critical for maintaining data integrity. Other records in the database can refer to this record by its primary key (or ID), and the index is used to resolve such references.

The primary index is typically the main means of accessing data. When creating a table, the primary key often doubles as a clustered index, meaning the data in the table is physically sorted on disk based on this key. This ensures quick data retrieval when searching by the primary key.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Franco Fernando
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More