# Batching Vector Inserts in zvec: Best Practices for High-Throughput Ingestion

> Maximize zvec vector insert throughput. Use Collection.insert or upsert with lists of Doc objects, staying under the 1024 document batch limit. Avoid lock contention with efficient batching.

- Repository: [Alibaba/zvec](https://github.com/alibaba/zvec)
- Tags: best-practices
- Published: 2026-02-16

---

**Batch vector inserts in zvec should use the `Collection.insert` or `Collection.upsert` APIs with lists of `Doc` objects, staying within the hard limit of 1024 documents per batch to maximize throughput and avoid lock contention.**

Batching vector inserts efficiently is critical when building high-performance AI applications with the `alibaba/zvec` vector database. The library provides optimized paths for bulk ingestion, but violating internal constraints or using single-document loops can severely degrade performance. This guide covers the hard limits, validation rules, and implementation patterns derived directly from the zvec source code.

## Understanding the Batch Insert API in zvec

The zvec Python API explicitly distinguishes between single-document and batch operations. In [`python/zvec/model/collection.py`](https://github.com/alibaba/zvec/blob/main/python/zvec/model/collection.py) (lines 33–50), the `insert` method detects whether you pass a `list[Doc]` or a single `Doc`. When a list is provided, it forwards the entire batch to the C++ layer in one operation, acquiring the write lock once for the entire set.

Passing single documents in a Python loop forces the C++ `write_impl` function to acquire the `std::lock_guard` for every iteration. This creates significant lock overhead and serializes what could be parallelized work.

## Hard Limits and Validation Constraints

### Respect the kMaxWriteBatchSize Limit (1024 Documents)

zvec enforces a hard upper bound on write batch sizes to prevent memory pressure and excessive lock hold times. In [`src/db/common/constants.h`](https://github.com/alibaba/zvec/blob/main/src/db/common/constants.h) at line 62, `kMaxWriteBatchSize` is defined as **1024** documents.

The `write_impl` function in `src/db/collection.cc` (lines 43–45) explicitly checks this limit:

```cpp
if (docs.size() > kMaxWriteBatchSize) {
    return Status::InvalidArgument("Too many docs");
}

```

Exceeding this threshold aborts the entire operation with an `InvalidArgument` error, so client applications must chunk large ingestions into ≤1024 document batches.

### Schema Validation Per Document

Every document in a batch undergoes schema validation before any data is written. In `src/db/collection.cc` (lines 33–36), `write_impl` iterates through the document vector calling `doc.validate()`. If any document fails validation, the entire batch fails.

Ensure that every `Doc` object conforms to the collection's schema—matching field types, non-nullable constraints, and vector dimensions—before including it in the batch.

## Performance Optimization Strategies

### Avoid Single-Document Loops

As noted in the source comments at `src/db/collection.cc` (line 38), the current write lock is coarse-grained. Single-document loops not only incur repeated lock acquisition costs but also block concurrent writers for the duration of each individual insert.

**Best practice:** Accumulate documents in a Python list (or C++ `std::vector`) until you reach your desired batch size (up to 1024), then submit the batch in a single call.

### Tune Segment Size for Large Batches

zvec manages storage in segments. When a segment reaches `max_doc_count_per_segment`, the engine switches to a new segment. The check occurs in `src/db/collection.cc` at lines 76–78 within `need_switch_to_new_segment`.

For very large ingestion jobs (millions of vectors), tuning the segment size can reduce the frequency of segment switches. However, for routine batch inserts under 1024 documents, this is typically negligible.

### Choose Insert Over Upsert for New Data

The `write_impl` function dispatches to different handlers based on the `WriteMode` (lines 56–66). `Insert` mode calls `handle_insert`, while `Upsert` calls `handle_upsert`, which performs an additional existence check.

When you know your primary keys are new, use `Collection.insert()` to avoid the lookup overhead. Use `Collection.upsert()` only when you require "insert-or-update" semantics.

## Code Examples

### Python: Bulk Inserting 10,000 Vectors

This example demonstrates chunking a large dataset into batches of 1024 documents, respecting `kMaxWriteBatchSize`:

```python
import zvec
from zvec import Collection, CollectionOption, DataType, Doc, FieldSchema, VectorSchema, HnswIndexParam

# Create collection once

schema = zvec.CollectionSchema(
    name="my_collection",
    fields=[
        FieldSchema("id", DataType.INT64, nullable=False),
        FieldSchema("name", DataType.STRING, nullable=False)
    ],
    vectors=[
        VectorSchema("dense", DataType.VECTOR_FP32, dimension=128,
                     index_param=HnswIndexParam())
    ],
)
option = CollectionOption(read_only=False, enable_mmap=True)
col = zvec.create_and_open(path="/tmp/my_collection", schema=schema, option=option)

# Build batches respecting kMaxWriteBatchSize (1024)

def build_batch(start, batch_size):
    return [
        Doc(
            id=str(i),
            fields={"id": i, "name": f"item_{i}"},
            vectors={"dense": [float(i % 100) * 0.01] * 128},
        )
        for i in range(start, start + batch_size)
    ]

batch_size = 1024  # Hard limit from constants.h

for offset in range(0, 10000, batch_size):
    batch = build_batch(offset, batch_size)
    statuses = col.insert(batch)  # Single round-trip for entire batch

    assert all(s.ok() for s in statuses)

```

*Implementation reference:* The `insert` method in [`python/zvec/model/collection.py`](https://github.com/alibaba/zvec/blob/main/python/zvec/model/collection.py) (lines 33–50) detects the list type and forwards to the C++ batch path, while `src/db/collection.cc` (lines 43–45) enforces the size limit.

### C++: Low-Level Batch Insert

For applications using the native C++ API, explicitly reserve capacity and pass a vector of `Doc` objects:

```cpp
#include <zvec/db/collection.h>
#include <zvec/db/doc.h>

using namespace zvec;

int main() {
    auto coll = Collection::Open("/tmp/my_collection", CollectionOptions());
    std::vector<Doc> docs;
    docs.reserve(1024);  // Match kMaxWriteBatchSize

    for (int i = 0; i < 1024; ++i) {
        Doc d;
        d.set_pk(std::to_string(i));
        d.set_any("id", FieldSchema::INT64, i);
        d.set_any("name", FieldSchema::STRING, 
                  std::string("item_") + std::to_string(i));
        std::vector<float> dense(128, static_cast<float>(i % 100) * 0.01f);
        d.set_any("dense", VectorSchema::VECTOR_FP32, dense);
        docs.emplace_back(std::move(d));
    }

    // Single lock acquisition for entire batch
    auto result = coll->Insert(docs);
    for (const auto& status : result) { 
        assert(status.ok()); 
    }
}

```

*Implementation reference:* `CollectionImpl::Insert` forwards to `write_impl` (`src/db/collection.cc` line 1370), which acquires the `std::lock_guard` once for the entire vector.

## Summary

- **Respect the 1024-document limit** defined in `kMaxWriteBatchSize` (`src/db/common/constants.h#L62`) to avoid `InvalidArgument` errors.
- **Always batch documents** into `list[Doc]` or `std::vector<Doc>` rather than inserting in loops to minimize lock contention in `write_impl`.
- **Validate schema compliance** before batch submission; a single invalid document fails the entire batch during the `doc.validate` phase.
- **Prefer `Insert` over `Upsert`** for new data to eliminate existence-check overhead, and monitor segment switching for very large ingestion jobs.

## Frequently Asked Questions

### What is the maximum batch size for vector inserts in zvec?

The hard limit is **1024 documents** per batch, defined by `kMaxWriteBatchSize` in [`src/db/common/constants.h`](https://github.com/alibaba/zvec/blob/main/src/db/common/constants.h) (line 62). The `write_impl` function in `src/db/collection.cc` (lines 43–45) explicitly checks this limit and returns an `InvalidArgument` status if exceeded.

### Should I use insert or upsert when batching vectors in zvec?

Use **`Collection.insert()`** when you know the primary keys are new, as it avoids the extra lookup overhead that `upsert` performs. Use **`Collection.upsert()`** only when you require "insert-or-update" semantics and are unsure whether the IDs already exist. The dispatch logic in `src/db/collection.cc` (lines 56–66) shows that `upsert` calls `handle_upsert`, which includes an existence check not present in `handle_insert`.

### Why is my batch insert slow even with small batches?

If you are inserting documents in a Python loop rather than accumulating them into a list, you force the C++ layer to acquire the write lock for every single document. The comment in `src/db/collection.cc` (line 38) notes that the lock is coarse-grained, so repeated acquisitions create significant contention. Accumulate documents into a `list[Doc]` and submit them as one batch to minimize lock overhead.

### How does zvec handle schema validation during batch inserts?

zvec validates **every document** in the batch against the collection schema before acquiring the write lock. In `src/db/collection.cc` (lines 33–36), `write_impl` iterates through the input vector calling `doc.validate()`. If any document fails validation, the entire batch operation returns an error status immediately, and no data is written. Ensure all documents conform to the schema's field types, nullability constraints, and vector dimensions before batching.