How Protobuf Binary Wire Format Encodes Different Field Types: A Complete Technical Guide
Protocol Buffers encodes every message field as a tag-value pair where the tag combines the field number and wire type, with the specific encoding determined by a static mapping in wire_format_lite.cc that assigns varint, fixed32, fixed64, or length-delimited formats based on the proto type.
The protobuf binary wire format is the compact, language-neutral serialization standard that powers Protocol Buffers' cross-platform compatibility. Implemented in the protocolbuffers/protobuf C++ runtime, this encoding scheme transforms structured message data into a stream of bytes by assigning specific binary representations to each proto field type. Understanding this encoding mechanism is crucial for optimizing message size and debugging deserialization failures.
Tag Construction and Wire Type Fundamentals
Every encoded field begins with a tag that identifies the field number and specifies how to interpret the subsequent bytes. The tag is encoded as a varint where the lower 3 bits store the wire type and the upper bits store the field number.
In src/google/protobuf/wire_format_lite.h, the MakeTag function constructs this value:
inline constexpr uint32_t WireFormatLite::MakeTag(int field_number,
WireType type) {
return GOOGLE_PROTOBUF_WIRE_FORMAT_MAKE_TAG(field_number, type);
}
The macro GOOGLE_PROTOBUF_WIRE_FORMAT_MAKE_TAG (defined around line 153) shifts the field number left by 3 bits and ORs the wire type. During serialization, WireFormatLite::WriteTag delegates to CodedOutputStream::WriteTag, implemented in src/google/protobuf/io/coded_stream.h around line 446.
Field Type to Wire Type Mapping
The mapping from proto field types to wire types is defined by the static array kWireTypeForFieldType in src/google/protobuf/wire_format_lite.cc (lines 94-114). This table determines the encoding strategy for each primitive type:
| Proto field type | Wire type | Encoding used |
|---|---|---|
double |
WIRETYPE_FIXED64 |
64-bit little-endian |
float |
WIRETYPE_FIXED32 |
32-bit little-endian |
int64 / uint64 |
WIRETYPE_VARINT |
Varint (unsigned) |
int32 / bool |
WIRETYPE_VARINT |
Varint (signed int is zig-zagged for sint*) |
fixed64 / sfixed64 |
WIRETYPE_FIXED64 |
64-bit little-endian |
fixed32 / sfixed32 |
WIRETYPE_FIXED32 |
32-bit little-endian |
string / bytes |
WIRETYPE_LENGTH_DELIMITED |
Length-prefixed byte array |
message |
WIRETYPE_LENGTH_DELIMITED |
Length-prefixed sub-message |
enum |
WIRETYPE_VARINT |
Varint (same as int32) |
group (deprecated) |
WIRETYPE_START_GROUP / WIRETYPE_END_GROUP |
Start-tag + embedded fields + end-tag |
Encoding Primitive Values
The CodedOutputStream class in src/google/protobuf/io/coded_stream.h implements the low-level byte manipulation for each encoding strategy.
Varint Encoding for Integer Types
Varints encode integers using base-128 representation where the high bit (0x80) indicates continuation. For unsigned types (uint32, uint64), the value is encoded directly. For standard signed types (int32, int64), negative values require 10 bytes due to sign extension.
ZigZag Encoding for Signed Integers
The sint32 and sint64 types use ZigZag encoding to minimize space for negative numbers. This mapping interleaves positive and negative values so that small magnitudes produce small varints regardless of sign.
In src/google/protobuf/wire_format_lite.h (lines 186-209), the transformation is implemented as:
inline uint32_t WireFormatLite::ZigZagEncode32(int32_t n) {
return (n << 1) ^ (n >> 31);
}
inline uint64_t WireFormatLite::ZigZagEncode64(int64_t n) {
return (n << 1) ^ (n >> 63);
}
Fixed-Size Encoding for Floating Point and Fixed Integers
Types float, double, fixed32, fixed64, sfixed32, and sfixed64 use little-endian byte order. CodedOutputStream::WriteLittleEndian32 and WriteLittleEndian64 ensure consistent encoding across platforms.
Length-Delimited Encoding for Strings, Bytes, and Messages
For string, bytes, and embedded message fields, the encoder first writes the payload length as a varint, followed by the raw bytes. This length-prefixing allows the parser to skip unknown fields efficiently by reading the length and advancing the cursor without interpreting the payload.
Practical Code Examples
Serializing a Message Manually with WireFormatLite
The following example demonstrates direct use of the encoding API to write raw protobuf bytes without generated classes:
// src/example/manual_serialisation.cc
#include <fstream>
#include "google/protobuf/io/coded_stream.h"
#include "google/protobuf/wire_format_lite.h"
int main() {
std::ofstream out("person.bin", std::ios::binary);
google::protobuf::io::OstreamOutputStream raw_out(&out);
google::protobuf::io::CodedOutputStream cos(&raw_out);
// field 1: int32 id = 123;
google::protobuf::WireFormatLite::WriteInt32(
1, 123, &cos); // tag = (1 << 3) | VARINT
// field 2: string name = "Alice";
google::protobuf::WireFormatLite::WriteString(
2, "Alice", &cos); // tag = (2 << 3) | LENGTH_DELIMITED
// field 3: bool is_employee = true;
google::protobuf::WireFormatLite::WriteBool(
3, true, &cos); // tag = (3 << 3) | VARINT
// field 4: repeated double scores = {3.14, 2.71};
const double scores[] = {3.14, 2.71};
for (double v : scores) {
google::protobuf::WireFormatLite::WriteDouble(
4, v, &cos); // tag = (4 << 3) | FIXED64
}
return 0;
}
The tag bytes are produced by WireFormatLite::WriteTag, which internally calls CodedOutputStream::WriteTag (see [coded_stream.h:446-462]).
Deserializing with CodedInputStream
Reading the binary data back requires parsing tags and dispatching to the appropriate read method:
// src/example/manual_deserialisation.cc
#include <fstream>
#include "google/protobuf/io/coded_stream.h"
#include "google/protobuf/wire_format_lite.h"
int main() {
std::ifstream in("person.bin", std::ios::binary);
google::protobuf::io::IstreamInputStream raw_in(&in);
google::protobuf::io::CodedInputStream cis(&raw_in);
while (!cis.ConsumedEntireMessage()) {
uint32_t tag = cis.ReadTag(); // reads varint tag
int field_no = google::protobuf::WireFormatLite::GetTagFieldNumber(tag);
auto type = google::protobuf::WireFormatLite::GetTagWireType(tag);
switch (field_no) {
case 1: { int32_t id; google::protobuf::WireFormatLite::ReadInt32(&cis, &id); /* … */ } break;
case 2: { std::string name; google::protobuf::WireFormatLite::ReadString(&cis, &name); /* … */ } break;
case 3: { bool emp; google::protobuf::WireFormatLite::ReadBool(&cis, &emp); /* … */ } break;
case 4: { double val; google::protobuf::WireFormatLite::ReadDouble(&cis, &val); /* … */ } break;
default: // unknown field → skip
google::protobuf::WireFormatLite::SkipField(&cis, tag);
}
}
}
The ReadTag method is defined in coded_stream.h around line 770 and uses the fast path for 1-byte tags ([coded_stream.h:777-785]).
Using Generated C++ Classes
In production code, developers typically rely on generated classes rather than manual encoding:
// src/example/person.proto
syntax = "proto3";
message Person {
int32 id = 1;
string name = 2;
bool is_employee = 3;
repeated double scores = 4;
}
// src/example/using_generated.cc
#include "person.pb.h"
#include <fstream>
int main() {
Person p;
p.set_id(123);
p.set_name("Alice");
p.set_is_employee(true);
p.add_scores(3.14);
p.add_scores(2.71);
// Serialize to binary file
std::ofstream out("person.bin", std::ios::binary);
p.SerializeToOstream(&out);
}
When SerializeToOstream executes, the generated Person::SerializeWithCachedSizes invokes the same WireFormatLite::Write* helpers described above, ensuring compliance with the wire format specification defined in wire_format_lite.cc.
Handling Unknown Fields
During parsing, the protobuf runtime can skip any field not defined in the current schema by interpreting the wire type embedded in the tag. The WireFormatLite::SkipField function in src/google/protobuf/wire_format_lite.cc (lines 16-56) implements this logic: for VARINT fields it reads until the continuation bit is clear; for FIXED32 it skips 4 bytes; for FIXED64 it skips 8 bytes; and for LENGTH_DELIMITED it reads the length varint and advances past the payload.
Summary
- Protobuf binary wire format encodes messages as sequences of tag-value pairs, where each tag is a varint combining the field number and wire type.
- The mapping from proto types to wire types is defined by
kWireTypeForFieldTypeinsrc/google/protobuf/wire_format_lite.cc, selecting between varint, fixed32, fixed64, and length-delimited strategies. - Varint encoding efficiently represents unsigned integers, while ZigZag encoding (used for
sint32/sint64) maps signed integers to unsigned values to optimize space for negative numbers. - Fixed-size encoding uses little-endian byte order for
float,double, and fixed-width integer types, ensuring platform-independent representation. - Length-delimited encoding prefixes strings, bytes, and sub-messages with a varint length, enabling efficient parsing and skipping of unknown fields via
WireFormatLite::SkipField.
Frequently Asked Questions
What is the protobuf binary wire format?
The protobuf binary wire format is the compact, binary serialization standard used by Protocol Buffers to encode structured data for transmission or storage. It represents each message as a sequence of tag-value pairs, where tags encode the field number and wire type as a varint, followed by the value encoded according to the specific wire type rules. This format is implemented in the C++ runtime of the protocolbuffers/protobuf repository and is language-independent, allowing cross-platform communication.
How does protobuf choose the wire type for different field types?
Protobuf selects the wire type by consulting the static lookup table kWireTypeForFieldType defined in src/google/protobuf/wire_format_lite.cc (lines 94-114). This array maps each FieldType enum value (such as TYPE_INT32, TYPE_STRING, or TYPE_DOUBLE) to a specific WireType enum value. For example, integer types map to WIRETYPE_VARINT, floating-point types map to WIRETYPE_FIXED32 or WIRETYPE_FIXED64, and variable-length types like strings map to WIRETYPE_LENGTH_DELIMITED.
Why does protobuf use ZigZag encoding for sint32 and sint64 types?
ZigZag encoding minimizes the varint size for signed integers by mapping signed values to unsigned values in a zig-zag pattern (0→0, -1→1, 1→2, -2→3, etc.). Without ZigZag, negative values for standard int32 or int64 would be sign-extended to 10 bytes in varint format. The sint32 and sint64 types use ZigZag encoding (implemented in WireFormatLite::ZigZagEncode32/64 in wire_format_lite.h) to ensure that small-magnitude negative numbers occupy minimal space, just like their positive counterparts.
How does protobuf handle unknown fields during deserialization?
When the parser encounters a field number not present in the current schema, it uses the wire type bits from the tag to skip the appropriate number of bytes without interpreting the payload. The WireFormatLite::SkipField function in src/google/protobuf/wire_format_lite.cc (lines 16-56) handles this: for WIRETYPE_VARINT it reads bytes until the continuation bit is clear; for WIRETYPE_FIXED32 it skips 4 bytes; for WIRETYPE_FIXED64 it skips 8 bytes; and for WIRETYPE_LENGTH_DELIMITED it reads the length varint and advances past the specified number of payload bytes. This allows backward and forward compatibility between different versions of protobuf schemas.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →