ZVec Collection Schema and Index Parameter Configurations: A Complete Guide

ZVec collection schemas define scalar fields via FieldSchema and vector fields via VectorSchema, while index parameters such as InvertIndexParam, HnswIndexParam, and IVFIndexParam control how each field is indexed for optimized search performance.

ZVec, Alibaba’s high-performance vector database, organizes data into collections governed by structured schemas. Understanding zvec collection schema and index parameter configurations is essential for defining data types, enabling efficient similarity search, and optimizing query performance across both scalar and vector fields.

Understanding ZVec Collection Schema Structure

The CollectionSchema class serves as the top-level container that defines the structure of a ZVec collection. According to the source code in python/zvec/model/schema/collection_schema.py, this class holds the collection name, a list of scalar FieldSchema objects, and a list of VectorSchema objects. It automatically validates the uniqueness of field names across both scalar and vector fields to prevent naming collisions.

FieldSchema for Scalar Fields

Scalar fields are defined using the FieldSchema class, located in python/zvec/model/schema/field_schema.py. Each scalar field description includes:

  • Name: The column identifier
  • DataType: The scalar data type (e.g., INT64, STRING)
  • Nullable: Boolean indicating whether null values are permitted
  • Index Parameter: Optional InvertIndexParam for enabling inverted index optimizations

The InvertIndexParam supports range queries and wildcard search on scalar columns, with options like enable_range_optimization and enable_extended_wildcard.

VectorSchema for Vector Fields

Vector fields use the VectorSchema class (also in python/zvec/model/schema/field_schema.py) to define high-dimensional embeddings. Each vector schema specifies:

  • Name: The vector column identifier
  • DataType: Vector data type (e.g., VECTOR_FP32)
  • Dimension: The vector dimensionality (e.g., 128, 768)
  • Index Parameter: Required vector index configuration (FlatIndexParam, HnswIndexParam, or IVFIndexParam)

Building a Collection Schema in Python

Here is a complete example demonstrating how to construct a collection schema with both scalar and vector fields:

from zvec import CollectionSchema, FieldSchema, VectorSchema
from zvec.typing import DataType, MetricType
from zvec.model.param import InvertIndexParam, HnswIndexParam

# Define scalar field with inverted index for range queries

doc_id_field = FieldSchema(
    name="doc_id",
    data_type=DataType.INT64,
    nullable=False,
    index_param=InvertIndexParam(enable_range_optimization=True),
)

# Define vector field with HNSW index for approximate search

embedding_field = VectorSchema(
    name="embedding",
    data_type=DataType.VECTOR_FP32,
    dimension=128,
    index_param=HnswIndexParam(m=16, ef_construction=200, metric_type=MetricType.COSINE),
)

# Assemble the collection schema

schema = CollectionSchema(
    name="document_collection",
    fields=[doc_id_field],
    vectors=[embedding_field],
)

The constructor automatically validates that field names are unique across both fields and vectors, and that index parameters match their respective field types.

Index Parameter Configurations in ZVec

ZVec defines three distinct families of index parameters, implemented in C++ in src/include/zvec/db/index_params.h and exposed to Python through zvec.model.param. These configurations determine how data is indexed and queried.

Scalar Index Parameters

For scalar fields, ZVec provides the InvertIndexParam class:

  • Purpose: Optimizes range queries and wildcard searches on non-vector columns
  • Key Parameters:
    • enable_range_optimization: Boolean to enable optimized range filtering
    • enable_extended_wildcard: Boolean to support advanced wildcard patterns

Vector Index Parameters

ZVec supports three vector index strategies, each with distinct performance characteristics:

FlatIndexParam

  • Purpose: Brute-force exact search with 100% recall
  • Best for: Small datasets or when exact results are mandatory
  • Parameters: metric_type (IP, L2, COSINE)

HnswIndexParam

  • Purpose: Approximate nearest neighbor search using hierarchical navigable small world graphs
  • Best for: High-recall, low-latency applications
  • Key Parameters:
    • m: Number of bi-directional links for each node (typically 8-32)
    • ef_construction: Size of dynamic candidate list during construction (higher = better quality)
    • metric_type: Distance metric

IVFIndexParam

  • Purpose: Inverted file index with optional SOAR acceleration for large-scale datasets
  • Best for: Billion-scale vector search with memory efficiency
  • Key Parameters:
    • n_list: Number of coarse centroids (typically 4*sqrt(n) for n vectors)
    • n_iters: Number of k-means refinement iterations
    • use_soar: Boolean to enable SOAR acceleration
    • metric_type: Distance metric

Creating Index Parameters in Python

Here is a comprehensive example showing all index parameter types:

from zvec.model.param import (
    InvertIndexParam,
    FlatIndexParam,
    HnswIndexParam,
    IVFIndexParam,
)
from zvec.typing import MetricType

# Scalar inverted index

scalar_idx = InvertIndexParam(
    enable_range_optimization=True,
    enable_extended_wildcard=False,
)

# Flat index for exact search

flat_idx = FlatIndexParam(metric_type=MetricType.IP)

# HNSW index for high-performance ANN

hnsw_idx = HnswIndexParam(
    m=16,
    ef_construction=200,
    metric_type=MetricType.COSINE,
)

# IVF index for large-scale search

ivf_idx = IVFIndexParam(
    metric_type=MetricType.L2,
    n_list=1024,
    n_iters=10,
    use_soar=False,
)

How ZVec Applies Schema and Index Configurations

ZVec enforces schema and index configurations at multiple stages of the collection lifecycle, as implemented in python/zvec/zvec.py and the underlying C++ core.

Collection Creation

When calling zvec.Collection.create_and_open, the supplied CollectionSchema is passed to the C++ core via _Collection.CreateAndOpen. The core performs the following operations:

  1. Stores the schema metadata persistently
  2. Builds any requested vector indexes immediately (HNSW, IVF, or Flat)
  3. Registers inverted indexes for scalar fields that specify InvertIndexParam

Dynamic Schema Modifications

ZVec supports schema evolution through the Collection.add_column method. When adding a new column, you can include an index_param argument, and ZVec will instantiate the proper index backend for that column immediately.

Index Management

The Collection.create_index and Collection.drop_index methods allow runtime index modifications. These methods validate that the supplied index_param matches the field type (scalar vs. vector) before invoking the underlying C++ CreateIndex or DropIndex methods. The concrete index objects (HnswIndexParams, IVFIndexParams, FlatIndexParams, InvertIndexParams) are instantiated in the C++ layer based on these Python configurations.

Key Source Files

The implementation of zvec collection schema and index parameter configurations spans both Python and C++ layers:

Path Role
python/zvec/model/schema/collection_schema.py Collection-level schema definition and field name validation
python/zvec/model/schema/field_schema.py Scalar FieldSchema and VectorSchema implementations
python/zvec/model/param/__init__.py Python façade exposing C++ index parameter classes
src/include/zvec/db/index_params.h C++ struct definitions for all index parameters
python/zvec/zvec.py High-level API methods (create_and_open, add_column, create_index)
python/tests/test_schema.py Unit tests for schema construction validation
python/tests/test_params.py Unit tests for index parameter instantiation

Summary

  • CollectionSchema acts as the top-level container in python/zvec/model/schema/collection_schema.py, enforcing unique field names across scalar and vector fields.
  • FieldSchema defines scalar columns with optional InvertIndexParam for range and wildcard optimization.
  • VectorSchema defines embedding columns with required vector index parameters chosen from FlatIndexParam, HnswIndexParam, or IVFIndexParam.
  • Index parameters are defined in C++ at src/include/zvec/db/index_params.h and exposed through zvec.model.param, controlling exact search, graph-based ANN, or inverted file indexing.
  • Lifecycle integration occurs through create_and_open, add_column, and create_index methods in python/zvec/zvec.py, which validate configurations and instantiate concrete C++ index objects.

Frequently Asked Questions

What is the difference between FieldSchema and VectorSchema in ZVec?

FieldSchema defines scalar (non-vector) columns such as integers, strings, or timestamps, and optionally accepts an InvertIndexParam for range queries. VectorSchema specifically defines high-dimensional embedding columns, requires a dimensionality parameter, and mandates a vector index parameter (FlatIndexParam, HnswIndexParam, or IVFIndexParam) to enable similarity search.

How do I choose between HNSW and IVF index parameters in ZVec?

Choose HnswIndexParam when you require sub-millisecond latency with high recall on datasets ranging from thousands to hundreds of millions of vectors, as it builds a navigable small-world graph optimized for approximate nearest neighbor search. Choose IVFIndexParam for billion-scale datasets where memory efficiency is critical, as it uses inverted file indexing with coarse quantization; enable the use_soar flag for additional acceleration on large partitions.

Can I modify a collection schema after creating the collection?

Yes, ZVec supports schema evolution through the Collection.add_column method, which allows you to add new scalar or vector fields after initial creation. When adding a column, you can specify an index_param to immediately build the appropriate index backend. However, existing field definitions cannot be altered or removed; you can only add new columns or create/drop indexes on existing columns using create_index and drop_index.

What file contains the C++ definitions for ZVec index parameters?

The C++ struct definitions for all index parameters are located in src/include/zvec/db/index_params.h. This header defines InvertIndexParam, FlatIndexParam, HnswIndexParam, and IVFIndexParam structures that mirror the Python classes exposed through zvec.model.param. The Python layer in python/zvec/model/param/__init__.py acts as a façade that forwards configuration values to these underlying C++ implementations.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →