How to Optimize ZVec Index Performance with Merge and Optimize: A Complete Guide

Use CollectionImpl::Optimize for automatic collection-wide compaction that selects high-delete-ratio segments and rebuilds indexes, or call VectorColumnIndexer::Merge directly for low-level control over combining specific segment indexes with custom concurrency settings.

ZVec, Alibaba's high-performance vector database, stores vectors in segment files where each segment maintains its own Proxima engine index. Over time, accumulated small segments and deleted documents degrade query latency and increase memory pressure. This guide explains how to optimize zvec index performance with merge and optimize operations using the actual implementation from the alibaba/zvec repository.

Understanding ZVec's Segment Architecture

ZVec persists vectors in immutable segment files. Each segment contains its own vector index implemented by the Proxima engine, along with scalar indexes and metadata. As documents are inserted and deleted, a collection accumulates many small segments and tombstone entries, which forces queries to scan multiple indexes and increases memory overhead. The codebase provides two complementary mechanisms to consolidate this data: Merge for engine-level index operations and Optimize for collection-wide compaction orchestration.

The Merge Mechanism: Low-Level Index Consolidation

VectorColumnIndexer::Merge is the primitive operation that combines the Proxima engine indexes of several source segments into a single target index. This function lives in src/db/index/column/vector_column/vector_column_indexer.cc at lines 101-133.

VectorColumnIndexer::Merge Implementation

The merge process begins when a target VectorColumnIndexer receives a list of source indexers, an optional IndexFilter, and a MergeOptions struct:

Status VectorColumnIndexer::Merge(
    const std::vector<VectorColumnIndexer::Ptr> &indexers,
    const IndexFilter::Ptr &filter,
    const vector_column_params::MergeOptions &merge_options);

Internally, the function unwraps each VectorColumnIndexer to extract its underlying Proxima engine pointer (engine_indexers). The user-provided IndexFilter is converted to a Proxima-compatible filter (engine_filter) to skip deleted documents during the merge. The actual work is delegated to the Proxima engine:

index->Merge(engine_indexers, *engine_filter,
             {merge_options.write_concurrency, merge_options.pool});

Configuring MergeOptions

The MergeOptions struct, defined in src/include/zvec/db/options.h, controls concurrency behavior:

struct MergeOptions {
    uint32_t write_concurrency{1};
    ailego::ThreadPool *pool{nullptr};
};
  • write_concurrency: When set greater than 0, the Proxima engine parallelizes the merge across that many writer threads.
  • pool: When non-null, the merge executes inside the supplied ailego::ThreadPool, allowing integration with ZVec's global optimize thread pool or custom external pools.

The Optimize Workflow: Collection-Wide Compaction

CollectionImpl::Optimize provides the high-level orchestration for collection-wide index maintenance. This method, located in src/db/collection.cc at lines 686-721, automates segment selection, merge execution, and metadata updates.

CollectionImpl::Optimize Orchestration

The optimize workflow follows a strict sequence to ensure consistency:

  1. Locking: Acquires collection-wide write locks to prevent concurrent schema changes and writes.

  2. Segment Selection: Gathers all persisted (read-only) segments. If the active writing segment contains data, it is flushed first.

  3. Task Building: build_compact_task analyzes delete ratios against COMPACT_DELETE_RATIO_THRESHOLD to determine which segments need rebuilding. It creates CreateVectorIndexTask instances for each vector column.

  4. Merge Options Setup: Inside SegmentHelper::CreateVectorIndexTask (src/db/index/segment/segment_helper.cc, lines 623-639), the system populates MergeOptions:

    vector_column_params::MergeOptions merge_options;
    if (concurrency == 0) {
        merge_options.pool = GlobalResource::Instance().optimize_thread_pool();
    } else {
        merge_options.write_concurrency = concurrency;
    }
  5. Execution: Tasks run in parallel using the thread pool. After each vector index merge, the new segment is flushed to disk.

  6. Version Update: The version manager atomically swaps old segment metadata with newly built segments.

  7. Garbage Collection: Old segments are destroyed and their on-disk files are removed.

Segment Selection and Delete Ratios

The optimizer specifically targets segments with high delete ratios. When the proportion of deleted documents in a segment exceeds COMPACT_DELETE_RATIO_THRESHOLD, the segment is flagged for rebuilding. This threshold ensures that optimize operations focus on segments where compaction will yield the greatest performance improvement.

When to Use Merge vs Optimize

Choose the appropriate mechanism based on your operational requirements:

  • Frequent small writes: When your collection accumulates many tiny segments and delete ratios grow, call collection.Optimize() to compact everything automatically.
  • Control over specific fields: When you need to merge only specific vector indexes (e.g., after bulk importing into a single field), use VectorColumnIndexer::Merge directly with custom MergeOptions.
  • Custom thread pools: When you have an external thread pool you want to reuse, populate MergeOptions.pool with your ailego::ThreadPool* and pass it to Merge.

Code Examples

C++ Simple Optimize

#include <zvec/db/collection.h>
#include <zvec/db/options.h>

int main() {
    // Open an existing collection
    auto coll = zvec::Collection::Open("my_collection");
    
    // Optimize using the global optimize thread-pool (concurrency=0)
    zvec::OptimizeOptions opt{0};  // 0 => use GlobalResource::optimize_thread_pool()
    auto status = coll->Optimize(opt);
    
    if (!status.ok()) {
        LOG_ERROR("Optimize failed: %s", status.message().c_str());
    }
    return 0;
}

C++ Manual Merge

#include <zvec/db/index/column/vector_column/vector_column_indexer.h>
#include <zvec/db/index/common/index_filter.h>

using namespace zvec;

int main() {
    // Assume we already have two ready-to-merge indexers
    auto target = std::make_shared<VectorColumnIndexer>("path/to/target.idx", field_schema);
    auto src = std::make_shared<VectorColumnIndexer>("path/to/src.idx", field_schema);

    target->Open({true, true});   // mmap, create_new=true
    src->Open({true, true});

    // Optional: filter out deleted doc IDs
    auto filter = std::make_shared<IndexFilter>();
    // ... fill filter ...

    // Merge with 4 writer threads
    vector_column_params::MergeOptions opts;
    opts.write_concurrency = 4;  // or opts.pool = your_thread_pool;
    auto s = target->Merge({src}, filter, opts);
    
    if (!s.ok()) {
        LOG_ERROR("Merge failed: %s", s.message().c_str());
    }
}

Python Optimize Collection

from zvec import Collection, OptimizeOption

# Open an existing collection

coll = Collection.open("my_collection")

# Optimize using the default thread-pool (concurrency=0)

coll.optimize(OptimizeOption())  # equivalent to OptimizeOption(concurrency=0)

# Or limit to 2 threads

coll.optimize(OptimizeOption(concurrency=2))

Key Source Files

File Role Direct Link
src/db/index/column/vector_column/vector_column_indexer.cc Implements VectorColumnIndexer::Merge, the low-level merge primitive. https://github.com/alibaba/zvec/blob/main/src/db/index/column/vector_column/vector_column_indexer.cc
src/db/index/segment/segment_helper.cc Constructs MergeOptions for each column during optimization. https://github.com/alibaba/zvec/blob/main/src/db/index/segment/segment_helper.cc
src/db/collection.cc High-level CollectionImpl::Optimize driving segment compaction. https://github.com/alibaba/zvec/blob/main/src/db/collection.cc
src/include/zvec/db/options.h Definition of OptimizeOptions and MergeOptions. https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/options.h
python/zvec/model/collection.py Python wrapper exposing Collection.optimize. https://github.com/alibaba/zvec/blob/main/python/zvec/model/collection.py
python/zvec/model/param/__init__.py Python OptimizeOption dataclass definition. https://github.com/alibaba/zvec/blob/main/python/zvec/model/param/__init__.py

Summary

  • ZVec stores vectors in segment files with individual Proxima engine indexes, leading to fragmentation over time.
  • Merge (VectorColumnIndexer::Merge) is the engine-level primitive that combines indexes from multiple segments, supporting concurrent writes via MergeOptions.
  • Optimize (CollectionImpl::Optimize) orchestrates collection-wide compaction, automatically selecting segments with high delete ratios and invoking Merge for each vector column.
  • Configure concurrency through write_concurrency (thread count) or pool (custom thread pool) in MergeOptions.
  • Use Optimize for routine maintenance after bulk inserts or high deletion rates; use Merge directly when you need fine-grained control over specific vector columns.

Frequently Asked Questions

What is the difference between Merge and Optimize in ZVec?

Merge is a low-level operation implemented in VectorColumnIndexer::Merge that combines the Proxima engine indexes of specific source segments into a single target index. Optimize is a high-level collection management operation implemented in CollectionImpl::Optimize that orchestrates the entire compaction process: it selects segments with high delete ratios, creates new segments, invokes Merge for each vector column, updates metadata, and deletes old files.

How does concurrency affect ZVec index merging?

Concurrency is controlled through the MergeOptions struct passed to VectorColumnIndexer::Merge. You can set write_concurrency to a specific thread count (e.g., 4) to parallelize the Proxima engine's merge operation, or set concurrency to 0 in OptimizeOptions to use ZVec's global optimize thread pool via GlobalResource::Instance().optimize_thread_pool(). Higher concurrency speeds up large merges but consumes more CPU resources.

When should I run Optimize on a ZVec collection?

Run Optimize after scenarios that create index fragmentation: following bulk imports that generate many small segments, after a high volume of document deletions that leave tombstones in segments, or when query latency degrades due to excessive segment count. The CollectionImpl::Optimize function automatically targets segments exceeding COMPACT_DELETE_RATIO_THRESHOLD, so it is safe to run periodically as background maintenance.

Can I use a custom thread pool for ZVec index operations?

Yes. When calling VectorColumnIndexer::Merge directly, populate the pool field in MergeOptions with a pointer to your ailego::ThreadPool instance. This overrides the default behavior and executes the merge within your provided pool. This is useful when integrating ZVec's index maintenance with your application's existing thread management or resource isolation frameworks.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →