Batching Vector Inserts in zvec: Best Practices for High-Throughput Ingestion
Batch vector inserts in zvec should use the Collection.insert or Collection.upsert APIs with lists of Doc objects, staying within the hard limit of 1024 documents per batch to maximize throughput and avoid lock contention.
Batching vector inserts efficiently is critical when building high-performance AI applications with the alibaba/zvec vector database. The library provides optimized paths for bulk ingestion, but violating internal constraints or using single-document loops can severely degrade performance. This guide covers the hard limits, validation rules, and implementation patterns derived directly from the zvec source code.
Understanding the Batch Insert API in zvec
The zvec Python API explicitly distinguishes between single-document and batch operations. In python/zvec/model/collection.py (lines 33–50), the insert method detects whether you pass a list[Doc] or a single Doc. When a list is provided, it forwards the entire batch to the C++ layer in one operation, acquiring the write lock once for the entire set.
Passing single documents in a Python loop forces the C++ write_impl function to acquire the std::lock_guard for every iteration. This creates significant lock overhead and serializes what could be parallelized work.
Hard Limits and Validation Constraints
Respect the kMaxWriteBatchSize Limit (1024 Documents)
zvec enforces a hard upper bound on write batch sizes to prevent memory pressure and excessive lock hold times. In src/db/common/constants.h at line 62, kMaxWriteBatchSize is defined as 1024 documents.
The write_impl function in src/db/collection.cc (lines 43–45) explicitly checks this limit:
if (docs.size() > kMaxWriteBatchSize) {
return Status::InvalidArgument("Too many docs");
}
Exceeding this threshold aborts the entire operation with an InvalidArgument error, so client applications must chunk large ingestions into ≤1024 document batches.
Schema Validation Per Document
Every document in a batch undergoes schema validation before any data is written. In src/db/collection.cc (lines 33–36), write_impl iterates through the document vector calling doc.validate(). If any document fails validation, the entire batch fails.
Ensure that every Doc object conforms to the collection's schema—matching field types, non-nullable constraints, and vector dimensions—before including it in the batch.
Performance Optimization Strategies
Avoid Single-Document Loops
As noted in the source comments at src/db/collection.cc (line 38), the current write lock is coarse-grained. Single-document loops not only incur repeated lock acquisition costs but also block concurrent writers for the duration of each individual insert.
Best practice: Accumulate documents in a Python list (or C++ std::vector) until you reach your desired batch size (up to 1024), then submit the batch in a single call.
Tune Segment Size for Large Batches
zvec manages storage in segments. When a segment reaches max_doc_count_per_segment, the engine switches to a new segment. The check occurs in src/db/collection.cc at lines 76–78 within need_switch_to_new_segment.
For very large ingestion jobs (millions of vectors), tuning the segment size can reduce the frequency of segment switches. However, for routine batch inserts under 1024 documents, this is typically negligible.
Choose Insert Over Upsert for New Data
The write_impl function dispatches to different handlers based on the WriteMode (lines 56–66). Insert mode calls handle_insert, while Upsert calls handle_upsert, which performs an additional existence check.
When you know your primary keys are new, use Collection.insert() to avoid the lookup overhead. Use Collection.upsert() only when you require "insert-or-update" semantics.
Code Examples
Python: Bulk Inserting 10,000 Vectors
This example demonstrates chunking a large dataset into batches of 1024 documents, respecting kMaxWriteBatchSize:
import zvec
from zvec import Collection, CollectionOption, DataType, Doc, FieldSchema, VectorSchema, HnswIndexParam
# Create collection once
schema = zvec.CollectionSchema(
name="my_collection",
fields=[
FieldSchema("id", DataType.INT64, nullable=False),
FieldSchema("name", DataType.STRING, nullable=False)
],
vectors=[
VectorSchema("dense", DataType.VECTOR_FP32, dimension=128,
index_param=HnswIndexParam())
],
)
option = CollectionOption(read_only=False, enable_mmap=True)
col = zvec.create_and_open(path="/tmp/my_collection", schema=schema, option=option)
# Build batches respecting kMaxWriteBatchSize (1024)
def build_batch(start, batch_size):
return [
Doc(
id=str(i),
fields={"id": i, "name": f"item_{i}"},
vectors={"dense": [float(i % 100) * 0.01] * 128},
)
for i in range(start, start + batch_size)
]
batch_size = 1024 # Hard limit from constants.h
for offset in range(0, 10000, batch_size):
batch = build_batch(offset, batch_size)
statuses = col.insert(batch) # Single round-trip for entire batch
assert all(s.ok() for s in statuses)
Implementation reference: The insert method in python/zvec/model/collection.py (lines 33–50) detects the list type and forwards to the C++ batch path, while src/db/collection.cc (lines 43–45) enforces the size limit.
C++: Low-Level Batch Insert
For applications using the native C++ API, explicitly reserve capacity and pass a vector of Doc objects:
#include <zvec/db/collection.h>
#include <zvec/db/doc.h>
using namespace zvec;
int main() {
auto coll = Collection::Open("/tmp/my_collection", CollectionOptions());
std::vector<Doc> docs;
docs.reserve(1024); // Match kMaxWriteBatchSize
for (int i = 0; i < 1024; ++i) {
Doc d;
d.set_pk(std::to_string(i));
d.set_any("id", FieldSchema::INT64, i);
d.set_any("name", FieldSchema::STRING,
std::string("item_") + std::to_string(i));
std::vector<float> dense(128, static_cast<float>(i % 100) * 0.01f);
d.set_any("dense", VectorSchema::VECTOR_FP32, dense);
docs.emplace_back(std::move(d));
}
// Single lock acquisition for entire batch
auto result = coll->Insert(docs);
for (const auto& status : result) {
assert(status.ok());
}
}
Implementation reference: CollectionImpl::Insert forwards to write_impl (src/db/collection.cc line 1370), which acquires the std::lock_guard once for the entire vector.
Summary
- Respect the 1024-document limit defined in
kMaxWriteBatchSize(src/db/common/constants.h#L62) to avoidInvalidArgumenterrors. - Always batch documents into
list[Doc]orstd::vector<Doc>rather than inserting in loops to minimize lock contention inwrite_impl. - Validate schema compliance before batch submission; a single invalid document fails the entire batch during the
doc.validatephase. - Prefer
InsertoverUpsertfor new data to eliminate existence-check overhead, and monitor segment switching for very large ingestion jobs.
Frequently Asked Questions
What is the maximum batch size for vector inserts in zvec?
The hard limit is 1024 documents per batch, defined by kMaxWriteBatchSize in src/db/common/constants.h (line 62). The write_impl function in src/db/collection.cc (lines 43–45) explicitly checks this limit and returns an InvalidArgument status if exceeded.
Should I use insert or upsert when batching vectors in zvec?
Use Collection.insert() when you know the primary keys are new, as it avoids the extra lookup overhead that upsert performs. Use Collection.upsert() only when you require "insert-or-update" semantics and are unsure whether the IDs already exist. The dispatch logic in src/db/collection.cc (lines 56–66) shows that upsert calls handle_upsert, which includes an existence check not present in handle_insert.
Why is my batch insert slow even with small batches?
If you are inserting documents in a Python loop rather than accumulating them into a list, you force the C++ layer to acquire the write lock for every single document. The comment in src/db/collection.cc (line 38) notes that the lock is coarse-grained, so repeated acquisitions create significant contention. Accumulate documents into a list[Doc] and submit them as one batch to minimize lock overhead.
How does zvec handle schema validation during batch inserts?
zvec validates every document in the batch against the collection schema before acquiring the write lock. In src/db/collection.cc (lines 33–36), write_impl iterates through the input vector calling doc.validate(). If any document fails validation, the entire batch operation returns an error status immediately, and no data is written. Ensure all documents conform to the schema's field types, nullability constraints, and vector dimensions before batching.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →