# Zvec Serialization Formats for Data Import and Export: A Complete Technical Guide

> Explore zvec serialization formats for data import and export including vecs JSON Parquet and plain text. Master efficient data handling with this technical guide.

- Repository: [Alibaba/zvec](https://github.com/alibaba/zvec)
- Tags: how-to-guide
- Published: 2026-02-16

---

**Zvec utilizes five distinct serialization formats: a custom binary ".vecs" format for vector storage, JSON for index configuration, plain text for raw data ingestion, Apache Parquet for large-scale dataset conversion, and binary blobs for document metadata persistence.**

The Alibaba zvec vector search engine relies on specialized serialization formats to handle different stages of the data pipeline. Understanding these serialization formats is essential for optimizing data ingestion, index configuration, and runtime performance in production deployments.

## Custom Binary .vecs Format for Vector Storage

The primary storage format for vector data is a custom binary format with the `.vecs` extension. This format is defined in [`tools/core/vecs_common.h`](https://github.com/alibaba/zvec/blob/main/tools/core/vecs_common.h) through the `VecsHeader` structure, which specifies the layout for dense vectors, keys, sparse data sections, and optional tag lists.

The conversion from human-readable text to binary `.vecs` is handled by the `txt2vecs` tool implemented in `tools/core/txt2vecs.cc`. This tool parses plain text input and writes the binary header followed by metadata blocks, dense vectors, and sparse data sections.

```cpp
// Simplified excerpt from txt2vecs.cc
DEFINE_string(input, "input.txt", "txt input file");
DEFINE_string(output, "output.vecs", "vecs output file");
// …
bool ret = reader.load_record(FLAGS_input, ...);
// …
write_vecs_output(header, meta, keys, features, sparse_data, taglists);

```

At runtime, the engine loads `.vecs` files using `VecsReader` defined in [`tools/core/vecs_reader.h`](https://github.com/alibaba/zvec/blob/main/tools/core/vecs_reader.h). This class memory-maps the binary file and provides random access to vectors via `get_vector()` and keys via `get_key()`.

```cpp
#include "zvec/tools/core/vecs_reader.h"

int main() {
    zvec::core::VecsReader reader;
    if (!reader.load("my_vectors.vecs")) {
        std::cerr << "Failed to load .vecs file\n";
        return 1;
    }

    size_t n = reader.num_vecs();
    for (size_t i = 0; i < n; ++i) {
        uint64_t id = reader.get_key(i);
        const void* vec = reader.get_vector(i);
        const float* f = static_cast<const float*>(vec);
        // use the vector ...
    }
    return 0;
}

```

## JSON for Index Configuration and Parameters

Index creation options, index parameters, and query parameters are expressed as JSON strings. The `IndexParam` class in [`src/include/zvec/core/interface/index_param.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/core/interface/index_param.h) provides the `DeserializeFromJson` method to parse these configurations into C++ structures.

```cpp
#include "zvec/core/interface/index_param.h"

int main() {
    std::string json_cfg = R"({
        "dimension": 128,
        "metric_type": "L2",
        "index_type": "HNSW",
        "hnsw": { "ef_construction": 200 }
    })";

    zvec::core::IndexParam param;
    if (!param.DeserializeFromJson(json_cfg)) {
        std::cerr << "Invalid index JSON\n";
        return 1;
    }
    // pass `param` to IndexFactory to build the index …
}

```

## Plain Text Format for Raw Training Data

Before conversion to binary `.vecs`, raw vector data is often stored in plain text files with the `.txt` extension. The format follows the pattern `<id>;<value1> <value2> …`, where semicolons separate the identifier from the vector values.

The `txt2vecs.cc` tool parses this format using `load_record` to read the text file and convert it into the binary representation.

## Apache Parquet for Large-Scale Datasets

For large-scale training datasets, zvec provides a Python helper script [`tools/core/convert_cohere_parquet.py`](https://github.com/alibaba/zvec/blob/main/tools/core/convert_cohere_parquet.py) that reads Apache Parquet files. This script uses the **polars** library to load Parquet data and emits plain text files compatible with the `txt2vecs` tool.

```python

# convert_cohere_parquet.py

import polars as pl

def gen_vector_files(input_dir, pattern, out_dir, out_name):
    for p in Path(input_dir).rglob(pattern):
        df = pl.read_parquet(p)          # <-- Parquet read

        with open(Path(out_dir) / out_name, "a") as fout:
            for row in df.iter_rows():
                fid = row["id"]
                vec = row["emb"]
                line = f"{fid};" + " ".join(map(str, vec)) + ";\n"
                fout.write(line)

```

## Binary Blob for Document Metadata

Document metadata is persisted using a binary blob format. The `Doc` class in [`src/include/zvec/db/doc.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/doc.h) provides `serialize` and `deserialize` methods that convert document objects to and from `std::vector<uint8_t>`.

This binary representation concatenates field IDs, types, and values into a flat byte buffer suitable for storage alongside the vector index.

## Summary

- **Custom binary .vecs**: Primary format for vector storage, defined by `VecsHeader` in [`tools/core/vecs_common.h`](https://github.com/alibaba/zvec/blob/main/tools/core/vecs_common.h) and accessed via `VecsReader`.
- **JSON**: Used for index configuration and query parameters, parsed by `IndexParam::DeserializeFromJson` in [`src/include/zvec/core/interface/index_param.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/core/interface/index_param.h).
- **Plain text**: Human-readable format `<id>;<values>` for raw data ingestion, processed by `txt2vecs.cc`.
- **Apache Parquet**: Supported via Python helper [`convert_cohere_parquet.py`](https://github.com/alibaba/zvec/blob/main/convert_cohere_parquet.py) for large-scale dataset conversion.
- **Binary blob**: Document metadata serialization via `Doc::serialize` in [`src/include/zvec/db/doc.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/doc.h).

## Frequently Asked Questions

### What is the primary vector storage format in zvec?

The primary vector storage format is a custom binary format with the `.vecs` extension. This format stores dense vectors, keys, sparse data, and optional tag lists in a single binary file defined by the `VecsHeader` structure in [`tools/core/vecs_common.h`](https://github.com/alibaba/zvec/blob/main/tools/core/vecs_common.h). The engine reads these files at runtime using the `VecsReader` class.

### How does zvec handle large-scale dataset ingestion from Parquet files?

Zvec provides a Python conversion script [`tools/core/convert_cohere_parquet.py`](https://github.com/alibaba/zvec/blob/main/tools/core/convert_cohere_parquet.py) that uses the polars library to read Apache Parquet files. The script extracts vector embeddings and identifiers, then writes them to plain text files that can be processed by the `txt2vecs` tool to produce binary `.vecs` files for the engine.

### Can index parameters be configured via JSON in zvec?

Yes, index creation and query parameters are configured using JSON strings. The `IndexParam` class in [`src/include/zvec/core/interface/index_param.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/core/interface/index_param.h) provides the `DeserializeFromJson` method to parse JSON configurations into C++ structures. This allows specification of dimension, metric type, index type, and algorithm-specific parameters such as HNSW's `ef_construction`.

### What format does zvec use for persisting document metadata?

Document metadata is serialized as a binary blob using the `Doc` class in [`src/include/zvec/db/doc.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/doc.h). The `serialize` method converts document fields into a `std::vector<uint8_t>` buffer, while `deserialize` reconstructs the document object from the binary data. This enables efficient storage of metadata alongside vector indices.