Can You Patch a Protobuf File? Not Really—and Here’s Why

DM Television

A first look at the new Intel-powered Microsoft Surface 5G laptop

July

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Can You Patch a Protobuf File? Not Really—and Here’s Why

Tags: api apis application frameworks google mobile small underlying

Author: DATE POSTED:June 25, 2025

Feed: Hacker Noon - Medium

View: Original article

In the world of high-performance distributed systems, data serialization frameworks like Google Protocol Buffers (Protobuf) are indispensable. They offer compact binary formats and efficient parsing, making them ideal for everything from inter-service communication to persistent data storage. But when it comes to updating just a small part of an already serialized data blob, a common question arises: can we "patch" it directly, avoiding the overhead of reading, modifying, and re-writing the entire thing?

The short answer, for most practical purposes, is no. While Protobuf provides clever mechanisms that seem to offer direct patching, the reality is more nuanced. Let's dive into why the full "read-modify-write" cycle remains largely unavoidable and where the true efficiencies lie.

The Core Challenge: Binary Data's Unfixed Nature

Imagine a book where every word's length can change, and there are no fixed page numbers for individual words. If you change a single word, all subsequent words on that page (and potentially the entire book) would shift, requiring a complete re-layout. This is akin to the challenge of patching a binary serialized blob.

Protobuf, like Apache Thrift, uses a compact, variable-length binary encoding. Fields are identified by unique numeric tags, and their values are encoded efficiently, often with variable-length integers or length-prefixed strings. This design is fantastic for minimizing data size and maximizing parsing speed. However, it means that the exact byte offset and length of any given field are not fixed. Changing the value of a field, especially a string, can alter its byte length, which would then shift the positions of all subsequent fields in the binary stream. Attempting an "in-place" modification without recalculating and shifting all subsequent bytes would lead to data corruption.

Misconception 1: The "Last Field Wins" Magic Trick

One intriguing feature of Protocol Buffers is its "last field wins" merge behavior for non-repeated fields. This means if you have two serialized Protobuf messages for the same type, and you concatenate their binary forms, when the combined stream is deserialized, the value of a non-repeated field from the last occurrence in the stream will be used. For repeated fields, new values are appended, not overwritten.

How it seems to work (and why it's misleading for patching):

Let's say you have an original Person object serialized into a blob:

Original Blob: [name="Alice", age=30, phone_number=["111", "222"]]

You want to update only the name to "Alicia". You could create a new, small Protobuf message containing just the updated name:

Patch Blob: [name="Alicia"]

Then, you could concatenate this Patch Blob to the Original Blob:

Combined Blob: [name="Alice", age=30, phone_number=["111", "222"]] + [name="Alicia"]

When a Protobuf parser reads this Combined Blob, due to "last field wins," the name will indeed resolve to "Alicia," while age and phone_number will retain their original values.

The Catch: While this appears to be a patch, it's a deserialization rule, not a binary patching mechanism. The parser still has to read and process the entire concatenated stream to determine the final state of the message. You haven't avoided the deserialization cost; you've just changed how the parser resolves conflicts during deserialization.

Furthermore, this approach has severe limitations:

Only for Root Objects and Non-Repeated Fields: It "only works well for the root object" and "doesn't work for repeated" fields. If you tried to update a specific phone number, or a field within a nested message, this concatenation trick would fail or lead to unintended appends.
Increased Storage/Transmission Size: You're now storing or transmitting more data (original + patch) than if you had simply re-serialized the whole object.

Misconception 2: FieldMask Saves Re-serialization Cost

Google's official Protobuf best practices recommend using FieldMask for supporting partial updates in APIs. This is an excellent pattern, but it's crucial to understand where its efficiency gains truly lie.

How FieldMask works:

A FieldMask is a separate Protobuf message that explicitly lists the paths of the fields a client intends to modify (e.g., name, address.street). When a client wants to update a resource, it sends a small request containing:

The FieldMask itself.
Only the partial data for the fields specified in the mask.

Example of a network payload using FieldMask:

Instead of sending:

{ "name": "Alicia", "age": 30, "phone_number": ["111", "222"] } // (full object)

A client might send:

{ "update_mask": { "paths": ["name"] }, "person": { "name": "Alicia" } } // (much smaller payload)

Where FieldMask truly shines (and why re-serialization is still needed):

FieldMask significantly improves efficiency, but not by avoiding the deserialization/re-serialization cycle on the server's persistent data. Its benefits are primarily at the network communication and application logic layers:

Bandwidth Optimization: By sending only the FieldMask and the partial data, the request payload size is drastically reduced. This saves network bandwidth, especially critical for mobile clients or high-volume APIs.
Reduced Server-Side Processing: The server receives explicit instructions on which fields to update. This streamlines the application logic, preventing the server from having to infer changes or process a large object where most fields are unchanged.

However, once the server receives this partial update request, to apply it to the stored, serialized data, it still performs the following steps:

Retrieve Existing Data: The server fetches the full, existing serialized blob from its storage.
Deserialize: The entire blob is deserialized into a complete in-memory Protobuf object.
Apply Patch: The application logic uses the FieldMask to update only the specified fields on this in-memory object.
Re-serialize: The entire modified in-memory object is then re-serialized into a new binary blob.
Persist: This new blob replaces the old one in storage.

The Unavoidable Truth: Read-Modify-Write

For any robust and reliable modification of a Protocol Buffer serialized data blob, the read-modify-write cycle is the standard and necessary approach. This is because:

Data Integrity: It ensures that the entire object remains consistent and correctly encoded after the modification.
Schema Evolution: It gracefully handles schema changes (adding/removing fields) by allowing the parser to correctly interpret the full data structure.
Binary Format Constraints: The variable-length nature of Protobuf's encoding makes direct byte-level manipulation impractical and prone to corruption.

Conclusion

Protocol Buffers are incredibly powerful for efficient data serialization and schema evolution. Features like "last field wins" and FieldMask are valuable tools, but their utility for "patching" existing serialized blobs is often misunderstood.

The "last field wins" behavior is a deserialization rule that can be leveraged for simple, non-repeated field updates via concatenation, but it still requires full deserialization and is not a general-purpose binary patching solution.

The FieldMask is an excellent API design pattern that optimizes network bandwidth and simplifies application logic for partial updates, but the server still performs a full read-modify-write cycle on the underlying data.

Ultimately, if you need to modify a Protobuf serialized blob, prepare for the full read-modify-write dance. The true efficiencies come from optimizing the communication of the patch (e.g., with FieldMask) and the in-memory processing, rather than magically altering bytes on disk.