Database sharding case study: PostgreSQL
Lessons learned from how the Notion's team sharded their monolithic Postgres.
Hi Friends,
Welcome to the 108th issue of the Polymathic Engineer newsletter.
A few weeks ago, we discussed in depth what database sharding is and saw how hard it can be to scale a database horizontally. Today, we will go through a real-world case study to see the challenges Notion faced in sharding their Postgres database and how they overcame them.
The outline will be as follows:
The Database Scaling Challenge
Sharding Strategy and Design Decisions
Implementation and Migration Process
Lessons Learned
Project-based learning is the best way to develop technical skills. CodeCrafters is an excellent platform for practicing exciting projects, such as building your version of Redis, Kafka, HTTP server, SQLLite, or Git from scratch.
Sign up, and become a better software engineer.
Notion's Database Scaling Challenge
Notion is a popular productivity and collaboration platform that serves as an all-in-one workspace. It's super powerful, and you can use it for taking notes, building spreadsheets, managing calendars, tracking timelines, and much more.
One critical challenge Notion engineers had to deal with was in mid-2020 and had to do with their monolithic PostgreSQL database. The company relied on it for 5 years, supporting its massive growth through several orders of magnitude.
However, as Notion's user base kept expanding rapidly, the database struggled with the increasing load. Several indicators pointed to the growing strain on the database, like frequent CPU spikes and a high volume of on-call requests.
The main problem was that PostgreSQL's VACUUM process, which cleans up and frees up disk space, started to stop working regularly. At this point, the database wasn't handling its storage well, which caused performance to drop.
Notion's engineering team realized their current database architecture was no longer sustainable. They needed a solution to address the immediate issues and provide a foundation for future growth.
After careful consideration, the team decided to implement database sharding. As we discussed, sharding is a horizontal scaling technique that allows data distribution across multiple database instances.
Sharding Strategy and Design Decisions
When Notion's team chose to shard, they had to make important choices about how to do it.
The first big question was, what data should they shard? Notion's data model revolves around the concept of blocks. Each item on a Notion page - a paragraph, an image, or a database - is represented as a block. These blocks can contain other blocks, creating a hierarchical structure.
Given this setup, the Block table was the top candidate for sharding. But here was the tricky part: the Block table had connections to other tables. If they only sharded the Block table, they would have many complex cross-shard queries, which could slow things down.
So, they made a smart call and decided to shard not just the Block table but all tables that were connected to it. This way, they could keep data that was linked together and not have to do those annoying cross-shard queries.
The next question was: how to divide the data? Notion users usually work within one workspace, with each block belonging to a specific workspace. So, they chose to use the Workspace ID as their sharding key. This meant all the blocks and related data for a single workspace would end up on the same shard, minimizing the need to fetch data from multiple shards.
Finally, they had to decide on the number of shards. I loved this part since it involved a bit of math. Notion settled on 480 logical shards spread across 32 physical databases. Each physical database would host 15 logical shards.
Why 480? The reason was that it's divisible by many numbers, which gives the flexibility to add or remove physical hosts while keeping the shards evenly distributed. For example, they could scale from 32 to 40 to 48 hosts, making incremental jumps each time.
Implementation and Migration Process
Now that Notion had its sharding strategy, it was time to move all that data without messing things up for its users. They thought of a smart way to make this go smoothly.
First up was the double-write phase. They set things up so that new data would be written to the old and new databases while they worked on moving the existing data.
But here's where it gets interesting. Instead of writing directly to both databases (which could lead to inconsistencies if something went wrong), they used an audit log.
Basically, they kept a record of all the writings happening in the old database and then had a separate process to apply those changes to the new shared databases. This way, they could always catch up even if something hiccupped during the process.
To move all the old data over to the new sharded setup, Notion's team wrote a script that could copy over all the existing data, and they ran it on a beefy machine with 96 CPUs. Even with all that power, it still took about three days to move everything over. That gives you an idea of how much data we're talking about!
But moving the data wasn't enough - they had to ensure it was correct. So, they came up with two ways to verify the data:
They wrote a verification script comparing records between the old and new databases. They randomly sampled UUIDs and checked them to save time since checking every record would take forever.
They also did something called "dark reads". For some requests, they'd fetch data from old and new databases, compare them, and log any differences. They didn't use the data from the new database for these requests (hence "dark"), but it let them check that everything matched up without affecting users.
All of this work paid off. When it came time to switch over to the new sharded setup, they could do it with just a few minutes of downtime. For Notion's users, it mostly looked like a short blip of unavailability, but a massive change occurred behind the scenes.
Lessons Learned
Even though the sharding process worked, Notion's team learned some important lessons.
The first thing they realized was that they should have started earlier. Waiting until the existing database was struggling made the process more challenging. They had to be super careful with migrations to avoid putting even more stress on the system.
Another big takeaway was how important it is to aim for zero downtime. While the team managed to do the switch with just a few minutes of downtime, they realized that they could have done it without interruption if they had worked more on their catch-up script.
They also learned about how critical is to choose the right sharding key. Going with the Workspace ID was a great decision, as it kept related data together and minimized the need for complex cross-shard queries. However, they noticed that the application code could have been easier to write if they had put the primary and partition keys together in a single column.
Despite all the lessons learned, the outcome of the sharding process for Notion's users was that the product became noticeably faster. The new setup was able to handle the the load much better, which made things run more smoothly and sped up response times.
References
Herding elephants: Lessons learned from sharding Postgres at Notion
The Great Re-shard: adding Postgres capacity (again) with zero downtime
Why not using logical replication instead of custom audit-catchup scripts?
Excellent explanation Fernando