ZVec Collection Schema and Index Parameter Configurations: A Complete Guide
ZVec collection schemas define scalar fields via FieldSchema and vector fields via VectorSchema, while index parameters such as InvertIndexParam, HnswIndexParam, and IVFIndexParam control how each field is indexed for optimized search performance.
ZVec, Alibaba’s high-performance vector database, organizes data into collections governed by structured schemas. Understanding zvec collection schema and index parameter configurations is essential for defining data types, enabling efficient similarity search, and optimizing query performance across both scalar and vector fields.
Understanding ZVec Collection Schema Structure
The CollectionSchema class serves as the top-level container that defines the structure of a ZVec collection. According to the source code in python/zvec/model/schema/collection_schema.py, this class holds the collection name, a list of scalar FieldSchema objects, and a list of VectorSchema objects. It automatically validates the uniqueness of field names across both scalar and vector fields to prevent naming collisions.
FieldSchema for Scalar Fields
Scalar fields are defined using the FieldSchema class, located in python/zvec/model/schema/field_schema.py. Each scalar field description includes:
- Name: The column identifier
- DataType: The scalar data type (e.g.,
INT64,STRING) - Nullable: Boolean indicating whether null values are permitted
- Index Parameter: Optional
InvertIndexParamfor enabling inverted index optimizations
The InvertIndexParam supports range queries and wildcard search on scalar columns, with options like enable_range_optimization and enable_extended_wildcard.
VectorSchema for Vector Fields
Vector fields use the VectorSchema class (also in python/zvec/model/schema/field_schema.py) to define high-dimensional embeddings. Each vector schema specifies:
- Name: The vector column identifier
- DataType: Vector data type (e.g.,
VECTOR_FP32) - Dimension: The vector dimensionality (e.g., 128, 768)
- Index Parameter: Required vector index configuration (
FlatIndexParam,HnswIndexParam, orIVFIndexParam)
Building a Collection Schema in Python
Here is a complete example demonstrating how to construct a collection schema with both scalar and vector fields:
from zvec import CollectionSchema, FieldSchema, VectorSchema
from zvec.typing import DataType, MetricType
from zvec.model.param import InvertIndexParam, HnswIndexParam
# Define scalar field with inverted index for range queries
doc_id_field = FieldSchema(
name="doc_id",
data_type=DataType.INT64,
nullable=False,
index_param=InvertIndexParam(enable_range_optimization=True),
)
# Define vector field with HNSW index for approximate search
embedding_field = VectorSchema(
name="embedding",
data_type=DataType.VECTOR_FP32,
dimension=128,
index_param=HnswIndexParam(m=16, ef_construction=200, metric_type=MetricType.COSINE),
)
# Assemble the collection schema
schema = CollectionSchema(
name="document_collection",
fields=[doc_id_field],
vectors=[embedding_field],
)
The constructor automatically validates that field names are unique across both fields and vectors, and that index parameters match their respective field types.
Index Parameter Configurations in ZVec
ZVec defines three distinct families of index parameters, implemented in C++ in src/include/zvec/db/index_params.h and exposed to Python through zvec.model.param. These configurations determine how data is indexed and queried.
Scalar Index Parameters
For scalar fields, ZVec provides the InvertIndexParam class:
- Purpose: Optimizes range queries and wildcard searches on non-vector columns
- Key Parameters:
enable_range_optimization: Boolean to enable optimized range filteringenable_extended_wildcard: Boolean to support advanced wildcard patterns
Vector Index Parameters
ZVec supports three vector index strategies, each with distinct performance characteristics:
FlatIndexParam
- Purpose: Brute-force exact search with 100% recall
- Best for: Small datasets or when exact results are mandatory
- Parameters:
metric_type(IP, L2, COSINE)
HnswIndexParam
- Purpose: Approximate nearest neighbor search using hierarchical navigable small world graphs
- Best for: High-recall, low-latency applications
- Key Parameters:
m: Number of bi-directional links for each node (typically 8-32)ef_construction: Size of dynamic candidate list during construction (higher = better quality)metric_type: Distance metric
IVFIndexParam
- Purpose: Inverted file index with optional SOAR acceleration for large-scale datasets
- Best for: Billion-scale vector search with memory efficiency
- Key Parameters:
n_list: Number of coarse centroids (typically 4*sqrt(n) for n vectors)n_iters: Number of k-means refinement iterationsuse_soar: Boolean to enable SOAR accelerationmetric_type: Distance metric
Creating Index Parameters in Python
Here is a comprehensive example showing all index parameter types:
from zvec.model.param import (
InvertIndexParam,
FlatIndexParam,
HnswIndexParam,
IVFIndexParam,
)
from zvec.typing import MetricType
# Scalar inverted index
scalar_idx = InvertIndexParam(
enable_range_optimization=True,
enable_extended_wildcard=False,
)
# Flat index for exact search
flat_idx = FlatIndexParam(metric_type=MetricType.IP)
# HNSW index for high-performance ANN
hnsw_idx = HnswIndexParam(
m=16,
ef_construction=200,
metric_type=MetricType.COSINE,
)
# IVF index for large-scale search
ivf_idx = IVFIndexParam(
metric_type=MetricType.L2,
n_list=1024,
n_iters=10,
use_soar=False,
)
How ZVec Applies Schema and Index Configurations
ZVec enforces schema and index configurations at multiple stages of the collection lifecycle, as implemented in python/zvec/zvec.py and the underlying C++ core.
Collection Creation
When calling zvec.Collection.create_and_open, the supplied CollectionSchema is passed to the C++ core via _Collection.CreateAndOpen. The core performs the following operations:
- Stores the schema metadata persistently
- Builds any requested vector indexes immediately (HNSW, IVF, or Flat)
- Registers inverted indexes for scalar fields that specify
InvertIndexParam
Dynamic Schema Modifications
ZVec supports schema evolution through the Collection.add_column method. When adding a new column, you can include an index_param argument, and ZVec will instantiate the proper index backend for that column immediately.
Index Management
The Collection.create_index and Collection.drop_index methods allow runtime index modifications. These methods validate that the supplied index_param matches the field type (scalar vs. vector) before invoking the underlying C++ CreateIndex or DropIndex methods. The concrete index objects (HnswIndexParams, IVFIndexParams, FlatIndexParams, InvertIndexParams) are instantiated in the C++ layer based on these Python configurations.
Key Source Files
The implementation of zvec collection schema and index parameter configurations spans both Python and C++ layers:
| Path | Role |
|---|---|
python/zvec/model/schema/collection_schema.py |
Collection-level schema definition and field name validation |
python/zvec/model/schema/field_schema.py |
Scalar FieldSchema and VectorSchema implementations |
python/zvec/model/param/__init__.py |
Python façade exposing C++ index parameter classes |
src/include/zvec/db/index_params.h |
C++ struct definitions for all index parameters |
python/zvec/zvec.py |
High-level API methods (create_and_open, add_column, create_index) |
python/tests/test_schema.py |
Unit tests for schema construction validation |
python/tests/test_params.py |
Unit tests for index parameter instantiation |
Summary
- CollectionSchema acts as the top-level container in
python/zvec/model/schema/collection_schema.py, enforcing unique field names across scalar and vector fields. - FieldSchema defines scalar columns with optional
InvertIndexParamfor range and wildcard optimization. - VectorSchema defines embedding columns with required vector index parameters chosen from
FlatIndexParam,HnswIndexParam, orIVFIndexParam. - Index parameters are defined in C++ at
src/include/zvec/db/index_params.hand exposed throughzvec.model.param, controlling exact search, graph-based ANN, or inverted file indexing. - Lifecycle integration occurs through
create_and_open,add_column, andcreate_indexmethods inpython/zvec/zvec.py, which validate configurations and instantiate concrete C++ index objects.
Frequently Asked Questions
What is the difference between FieldSchema and VectorSchema in ZVec?
FieldSchema defines scalar (non-vector) columns such as integers, strings, or timestamps, and optionally accepts an InvertIndexParam for range queries. VectorSchema specifically defines high-dimensional embedding columns, requires a dimensionality parameter, and mandates a vector index parameter (FlatIndexParam, HnswIndexParam, or IVFIndexParam) to enable similarity search.
How do I choose between HNSW and IVF index parameters in ZVec?
Choose HnswIndexParam when you require sub-millisecond latency with high recall on datasets ranging from thousands to hundreds of millions of vectors, as it builds a navigable small-world graph optimized for approximate nearest neighbor search. Choose IVFIndexParam for billion-scale datasets where memory efficiency is critical, as it uses inverted file indexing with coarse quantization; enable the use_soar flag for additional acceleration on large partitions.
Can I modify a collection schema after creating the collection?
Yes, ZVec supports schema evolution through the Collection.add_column method, which allows you to add new scalar or vector fields after initial creation. When adding a column, you can specify an index_param to immediately build the appropriate index backend. However, existing field definitions cannot be altered or removed; you can only add new columns or create/drop indexes on existing columns using create_index and drop_index.
What file contains the C++ definitions for ZVec index parameters?
The C++ struct definitions for all index parameters are located in src/include/zvec/db/index_params.h. This header defines InvertIndexParam, FlatIndexParam, HnswIndexParam, and IVFIndexParam structures that mirror the Python classes exposed through zvec.model.param. The Python layer in python/zvec/model/param/__init__.py acts as a façade that forwards configuration values to these underlying C++ implementations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →