Zvec Serialization Formats for Data Import and Export: A Complete Technical Guide
Zvec utilizes five distinct serialization formats: a custom binary ".vecs" format for vector storage, JSON for index configuration, plain text for raw data ingestion, Apache Parquet for large-scale dataset conversion, and binary blobs for document metadata persistence.
The Alibaba zvec vector search engine relies on specialized serialization formats to handle different stages of the data pipeline. Understanding these serialization formats is essential for optimizing data ingestion, index configuration, and runtime performance in production deployments.
Custom Binary .vecs Format for Vector Storage
The primary storage format for vector data is a custom binary format with the .vecs extension. This format is defined in tools/core/vecs_common.h through the VecsHeader structure, which specifies the layout for dense vectors, keys, sparse data sections, and optional tag lists.
The conversion from human-readable text to binary .vecs is handled by the txt2vecs tool implemented in tools/core/txt2vecs.cc. This tool parses plain text input and writes the binary header followed by metadata blocks, dense vectors, and sparse data sections.
// Simplified excerpt from txt2vecs.cc
DEFINE_string(input, "input.txt", "txt input file");
DEFINE_string(output, "output.vecs", "vecs output file");
// …
bool ret = reader.load_record(FLAGS_input, ...);
// …
write_vecs_output(header, meta, keys, features, sparse_data, taglists);
At runtime, the engine loads .vecs files using VecsReader defined in tools/core/vecs_reader.h. This class memory-maps the binary file and provides random access to vectors via get_vector() and keys via get_key().
#include "zvec/tools/core/vecs_reader.h"
int main() {
zvec::core::VecsReader reader;
if (!reader.load("my_vectors.vecs")) {
std::cerr << "Failed to load .vecs file\n";
return 1;
}
size_t n = reader.num_vecs();
for (size_t i = 0; i < n; ++i) {
uint64_t id = reader.get_key(i);
const void* vec = reader.get_vector(i);
const float* f = static_cast<const float*>(vec);
// use the vector ...
}
return 0;
}
JSON for Index Configuration and Parameters
Index creation options, index parameters, and query parameters are expressed as JSON strings. The IndexParam class in src/include/zvec/core/interface/index_param.h provides the DeserializeFromJson method to parse these configurations into C++ structures.
#include "zvec/core/interface/index_param.h"
int main() {
std::string json_cfg = R"({
"dimension": 128,
"metric_type": "L2",
"index_type": "HNSW",
"hnsw": { "ef_construction": 200 }
})";
zvec::core::IndexParam param;
if (!param.DeserializeFromJson(json_cfg)) {
std::cerr << "Invalid index JSON\n";
return 1;
}
// pass `param` to IndexFactory to build the index …
}
Plain Text Format for Raw Training Data
Before conversion to binary .vecs, raw vector data is often stored in plain text files with the .txt extension. The format follows the pattern <id>;<value1> <value2> …, where semicolons separate the identifier from the vector values.
The txt2vecs.cc tool parses this format using load_record to read the text file and convert it into the binary representation.
Apache Parquet for Large-Scale Datasets
For large-scale training datasets, zvec provides a Python helper script tools/core/convert_cohere_parquet.py that reads Apache Parquet files. This script uses the polars library to load Parquet data and emits plain text files compatible with the txt2vecs tool.
# convert_cohere_parquet.py
import polars as pl
def gen_vector_files(input_dir, pattern, out_dir, out_name):
for p in Path(input_dir).rglob(pattern):
df = pl.read_parquet(p) # <-- Parquet read
with open(Path(out_dir) / out_name, "a") as fout:
for row in df.iter_rows():
fid = row["id"]
vec = row["emb"]
line = f"{fid};" + " ".join(map(str, vec)) + ";\n"
fout.write(line)
Binary Blob for Document Metadata
Document metadata is persisted using a binary blob format. The Doc class in src/include/zvec/db/doc.h provides serialize and deserialize methods that convert document objects to and from std::vector<uint8_t>.
This binary representation concatenates field IDs, types, and values into a flat byte buffer suitable for storage alongside the vector index.
Summary
- Custom binary .vecs: Primary format for vector storage, defined by
VecsHeaderintools/core/vecs_common.hand accessed viaVecsReader. - JSON: Used for index configuration and query parameters, parsed by
IndexParam::DeserializeFromJsoninsrc/include/zvec/core/interface/index_param.h. - Plain text: Human-readable format
<id>;<values>for raw data ingestion, processed bytxt2vecs.cc. - Apache Parquet: Supported via Python helper
convert_cohere_parquet.pyfor large-scale dataset conversion. - Binary blob: Document metadata serialization via
Doc::serializeinsrc/include/zvec/db/doc.h.
Frequently Asked Questions
What is the primary vector storage format in zvec?
The primary vector storage format is a custom binary format with the .vecs extension. This format stores dense vectors, keys, sparse data, and optional tag lists in a single binary file defined by the VecsHeader structure in tools/core/vecs_common.h. The engine reads these files at runtime using the VecsReader class.
How does zvec handle large-scale dataset ingestion from Parquet files?
Zvec provides a Python conversion script tools/core/convert_cohere_parquet.py that uses the polars library to read Apache Parquet files. The script extracts vector embeddings and identifiers, then writes them to plain text files that can be processed by the txt2vecs tool to produce binary .vecs files for the engine.
Can index parameters be configured via JSON in zvec?
Yes, index creation and query parameters are configured using JSON strings. The IndexParam class in src/include/zvec/core/interface/index_param.h provides the DeserializeFromJson method to parse JSON configurations into C++ structures. This allows specification of dimension, metric type, index type, and algorithm-specific parameters such as HNSW's ef_construction.
What format does zvec use for persisting document metadata?
Document metadata is serialized as a binary blob using the Doc class in src/include/zvec/db/doc.h. The serialize method converts document fields into a std::vector<uint8_t> buffer, while deserialize reconstructs the document object from the binary data. This enables efficient storage of metadata alongside vector indices.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →