Hi Friends,
Welcome to the 92nd issue of the Polymathic Engineer newsletter.
This week, we discuss the advantages and disadvantages of binary encodings, taking Google Protobuf as an example.
The outline is as follows:
Binary encodings vs Textual formats
What is Protobuf, and how it works
Backward and Forward compatibility
Being hands-on is the best way to build technical skills. CodeCrafters is an excellent platform that lets you build exciting stuff, such as your version of Redis, HTTP server, and even Git from scratch.
Understanding how such things work under the hood sets you apart. Sign up, and become a better software engineer.
Binary encodings vs. Textual formats
Data serialization is the process of converting structured data into a format that makes it easy to store, send, and later reconstruct data. It is something that plays a fundamental role in how applications communicate with each other and store information.
You can choose from two main types of serialization forms: textual formats and binary encodings. Each format has its own set of advantages and trade-offs.
Textual formats like JSON and XML became popular for many reasons, especially with web-based apps. They are simple, easy to use, and understandable by humans, which makes debugging and manual inspection straightforward.
But this ease of reading comes with a price. To begin with, textual formats are verbose and require more bytes than binary encodings to store the same amount of data.
Second, they can be hard to parse correctly, especially when they contain special characters or big numbers. For example, JSON can't tell the difference between real and floating-point numbers, which can cause precision issues.
Finally, neither JSON nor XML can handle binary strings. You have to use Base64 encoding or some other method to circumvent this.
Binary encodings have been introduced to address such weaknesses. They provide a more compact representation of the data that requires fewer bytes, which means that less storage and bandwidth are needed.
They are quicker to parse and serialize, which speeds up applications. They support more data types with precise definitions and can include raw binary data without any extra encoding.
Because of these advantages, binary encodings are a better choice for use within an organization or for performance-critical applications.
What is Protobuf, and how it works
Protocol Buffers (Protobuf) is Google's solution for binary encoding. It is a language-neutral, platform-independent mechanism designed to serialize structured data in a way that's more compact and faster than traditional formats like JSON or XML.
Protobuf works through a combination of several key components. The process starts with defining your data in .proto files. These files use a simple, language-independent syntax to describe the schema of your data.
Here is an example:
syntax = "proto3";
message Person {
required string name = 1;
optional int32 age = 2;
repeated string hobbies = 3;
}
The first two fields in the above schema are marked as required or optional. This does not affect how the field is encoded, but the required keyword enables a runtime check that fails if the field is not set, which can help catch bugs. The third field is instead marked as repeated. This means the field may occur multiple times and can be used to implement lists and arrays.
Once you've defined your data structure, the protoc compiler generates code in your chosen programming language. In the code snippet below, the --csharp_out option means that the generated code is c#. Similar options are provided for other supported languages.
protoc -I=$SRC_DIR --csharp_out=$DST_DIR $SRC_DIR/person.proto
The generated code provides an interface you can easily work with, including methods for serialization and deserialization.
When serializing data, you use the generated methods to convert it into a compact binary format that gets stored or transmitted.
using Google.Protobuf;
Person mario = new Person
{
Name = "Super Mario",
Age = 43,
Hobbies = {"Eat mushrooms", "Hang out with Luigi"}
};
using (var output = File.Create("mario.dat"))
{
mario.WriteTo(output); //generated method
}
When deserializing data, you use the generated methods to parse the binary format back into usable data structures.
Person mario;
using (var input = File.OpenRead("mario.dat"))
{
mario = Person.Parser.ParseFrom(input); //generated method
}
Backward and Forward Compatibility
A Protobuf's key strength is its ability to handle schema evolution while maintaining backward and forward compatibility.
This means that old code can read messages generated by new code, and new code can read messages generated by old code, even when the .proto files have changed.
This is achieved by carefully using the tag numbers associated with each field in the .proto files and optional fields. Let's look at some examples to understand better.
Adding new fields to your schema without breaking existing code is easy. Your old code can read the new data by ignoring the fields with a tag number it doesn't recognize. The data type lets the parser know how many bytes it needs to skip.
At the same time, your new code can still read the old data as long as the new fields are not marked as required. (in that case, the runtime check would fail) A similar reasoning applies to the old code when removing fields from your schema.
In general, optional fields can be added or removed safely. Repeated fields can be changed to optional and vice versa.
Changing field types can be tricky since some changes are safe and others are not. For example, changing int32 to int64 is generally safe because the parser can fill missing bits with zeros.
However, when the old code reads data generated by the new code, this gets truncated since the old code still stores the value in a 32-bit variable.
Food for thoughts
When you first implement a feature, any code you write can do the job. But as you need to maintain and extend your code over time, a good implementation will make your changes and maintenance easier. That's where coding starts to get hard.
Design patterns are convenient, but they can't blindly be applied to every situation. Use them only when they bring a clear benefit. Link
Interesting Read
Some interesting articles to read this week:
Simply put, Fernando!
You nailed the versioning explanation; keeping backward compatibility is a topic often forgotten.
Thanks for the shoutout.
Nicely explained Fernando.
Protobufs are amazing. Also, great explanation of how the binary format is better than textual. I remember reading that HTTP 2 also went with binary for the same reason.
Also, thanks for the mention.