

Discover more from The Polymathic Engineer
Hi Friends,
Welcome to the 37th issue of the Polymathic Engineer newsletter. I apologize for skipping the last week. I went to my hometown to stay with my mother and sister and wanted to fully enjoy that moment without thinking of anything else.
This week will focus on a crucial topic for the backend of many distributed systems: how to generate unique identifiers. The outline will be as follows:
Introduction
UUID
ULID
Database generated IDs
Snowflake IDs
Generating Unique Identifiers
The backend of many large scale applications requires to generate unique IDs.
Identifiers like social security numbers, bank account numbers, tweet, image or file ids all require to be unique so that you can also use them as primary keys in the database tables.
When you’re working with a single MySQL database, you can simply use an auto-increment ID as the primary key or use a function getting the time of the day.
Using such sequential ids is like counting the pages of a book. Just as the pages of a book are named in order to help the reader know what comes next, these sequential ids give the data in a database a clear and consistent order.
However, sequential ids have some drawbacks. First, they can be predictable and expose sensitive information about your data or system. Second they fail in a distributed environment like a sharded MySQL database, since it would be hard to avoid conflicts between the generated IDs.
Here I want to explore some other formats and possibilities to create unique identifiers in distributed systems.
UUID
A widely used solution in distributed systems are Universal Unique Identifiers (UUIDs). UUIDs are 128-bit numbers that can be generated in a standard way at application level.
They are usually combined by several parts like time, node’s MAC address or MD5 hash. An example of UUID can be 550e8400-e29b-41d4-a716-446655440000.
The main benefits of UUIDs is that the space of the possible IDs is very large, virtually eliminating collisions. As an additional advantage, the uniqueness of the IDs doesn’t require any synchronization between the distributed nodes.
But UUIDs have also some drawbacks like:
size and efficiency: UUIDs are not very compact since they takes 128 bits, which can increase storage and transmission costs.
sorting: UUIDs cannot be sorted by time since they carry no information about when they were formed, making it difficult to organize them chronologically or compare them by age.
ULID
ULID (Universally Unique Lexicographically Sortable Identifier) is an identifier format that addresses some of the weakness of UUIDs: randomization and time-sortability.
ULIDs are made up of 26 characters that encode two components:
the first ten letters are the milliseconds since the Unix epoch (March 17th, 2023). This enables for simple sorting and comparison of ULIDs based on their creation time.
the remaining 16 characters constitute 80 bits of randomization, ensuring that the ULID is unique within the same millisecond. This decreases the possibility of sources or systems colliding.
An example of ULID can be 01FHZXHK8PTP9FVK99Z66GXQTX.
The main benefits of ULIDs are that they are more compact and efficient than UUIDs and are sortable by time. ULIDs, of course, have some limits and trade-offs.
First, ULID is not guaranteed to be globally unique because there is a slight probability of collision between two randomly produced strings during the same millisecond and because of the clock drift.
Second, ULID also reveals information about the identifiers' generation time and order, which may be undesirable for some applications that demand privacy or anonymity.
Database generated IDs
An alternate option is to generate IDs at the database level rather than the application level. Many databases include an auto increment function.
To generate unique IDs, a database server might be used. Flickr has used this strategy, which is also known as ticket service.
The key advantages of database generated IDs are that they simplify application code and are consecutive and short in length.
The biggest disadvantages are the additional round trip necessary to retrieve the IDs from the database and the database being a single point of failure.
The last issue can be mitigated by employing multiple database servers. A server, for example, could generate even IDs and odds IDs, while a round robin strategy could be used to balance the load across the servers and deal with downtime. The generated IDs clearly drift after a while.
Snowflake IDs
A last option is having a dedicated service generating snowflake Ids. This form of unique IDs was introduced by Twitter and then adopted also by Discord and Instagram.
The main idea is to generate the IDs as a composition of multiple fields. In the original implementation the fields were a 41-bit timestamp, a 10-bit worker ID, a 12-bit sequence number and a 1-bit reserved for future usage.
Some observations:
• the timestamp with ms resolution can work for 70 years after a starting epoch
• a coordination service (Zookeeper) assigns the worker ID considering both data center and server ID
• the sequence number support up to 4096 unique IDs per ms
Such way of generating IDs brings many advantages. First the system is high available, since can be implemented by 1024 machines, and scalable, since each machine can generate 4096 IDs per ms. Second the IDs are time sortable since the timestamp is in the higher order bits.
The main downside is that this approach involves the complexity of monitoring and maintaining a Zookeeper cluster, which, if a corporation does not already have, could be a significant strain.
References
Interesting Tweets
I have never met a user that cares about which technologies or programming languages you use. A great quote from Steve Jobs says: "You first start with the Customer Experience and then work your way backwards to develop the required technology". I like this mindset. Link
Writing code as quickly as possible and then refactor it into good code is actually faster than trying to write good code in the first place. Write small pieces of functionality at a time, testing them and refactoring is the right approach. Link
Code review comments are good as long as they:
1. clearly indicate if a remark is not blocking
2. give a reason why a change should be done
3. give possible suggestions for solving a problem
If a nit is clearly marked as nit, it shouldn't hurt. Link