# How to Optimize ZVec Index Performance with Merge and Optimize: A Complete Guide

> Optimize ZVec index performance using CollectionImpl Optimize for automatic compaction or VectorColumnIndexer Merge for direct segment index control with custom concurrency. Learn more.

- Repository: [Alibaba/zvec](https://github.com/alibaba/zvec)
- Tags: how-to-guide
- Published: 2026-02-16

---

**Use `CollectionImpl::Optimize` for automatic collection-wide compaction that selects high-delete-ratio segments and rebuilds indexes, or call `VectorColumnIndexer::Merge` directly for low-level control over combining specific segment indexes with custom concurrency settings.**

ZVec, Alibaba's high-performance vector database, stores vectors in segment files where each segment maintains its own Proxima engine index. Over time, accumulated small segments and deleted documents degrade query latency and increase memory pressure. This guide explains how to optimize zvec index performance with merge and optimize operations using the actual implementation from the `alibaba/zvec` repository.

## Understanding ZVec's Segment Architecture

ZVec persists vectors in immutable **segment files**. Each segment contains its own vector index implemented by the Proxima engine, along with scalar indexes and metadata. As documents are inserted and deleted, a collection accumulates many small segments and tombstone entries, which forces queries to scan multiple indexes and increases memory overhead. The codebase provides two complementary mechanisms to consolidate this data: **Merge** for engine-level index operations and **Optimize** for collection-wide compaction orchestration.

## The Merge Mechanism: Low-Level Index Consolidation

`VectorColumnIndexer::Merge` is the primitive operation that combines the Proxima engine indexes of several source segments into a single target index. This function lives in `src/db/index/column/vector_column/vector_column_indexer.cc` at lines 101-133.

### VectorColumnIndexer::Merge Implementation

The merge process begins when a target `VectorColumnIndexer` receives a list of source indexers, an optional `IndexFilter`, and a `MergeOptions` struct:

```cpp
Status VectorColumnIndexer::Merge(
    const std::vector<VectorColumnIndexer::Ptr> &indexers,
    const IndexFilter::Ptr &filter,
    const vector_column_params::MergeOptions &merge_options);

```

Internally, the function unwraps each `VectorColumnIndexer` to extract its underlying Proxima engine pointer (`engine_indexers`). The user-provided `IndexFilter` is converted to a Proxima-compatible filter (`engine_filter`) to skip deleted documents during the merge. The actual work is delegated to the Proxima engine:

```cpp
index->Merge(engine_indexers, *engine_filter,
             {merge_options.write_concurrency, merge_options.pool});

```

### Configuring MergeOptions

The `MergeOptions` struct, defined in [`src/include/zvec/db/options.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/options.h), controls concurrency behavior:

```cpp
struct MergeOptions {
    uint32_t write_concurrency{1};
    ailego::ThreadPool *pool{nullptr};
};

```

- **`write_concurrency`**: When set greater than 0, the Proxima engine parallelizes the merge across that many writer threads.
- **`pool`**: When non-null, the merge executes inside the supplied `ailego::ThreadPool`, allowing integration with ZVec's global optimize thread pool or custom external pools.

## The Optimize Workflow: Collection-Wide Compaction

`CollectionImpl::Optimize` provides the high-level orchestration for collection-wide index maintenance. This method, located in `src/db/collection.cc` at lines 686-721, automates segment selection, merge execution, and metadata updates.

### CollectionImpl::Optimize Orchestration

The optimize workflow follows a strict sequence to ensure consistency:

1. **Locking**: Acquires collection-wide write locks to prevent concurrent schema changes and writes.
2. **Segment Selection**: Gathers all persisted (read-only) segments. If the active writing segment contains data, it is flushed first.
3. **Task Building**: `build_compact_task` analyzes delete ratios against `COMPACT_DELETE_RATIO_THRESHOLD` to determine which segments need rebuilding. It creates `CreateVectorIndexTask` instances for each vector column.
4. **Merge Options Setup**: Inside `SegmentHelper::CreateVectorIndexTask` (`src/db/index/segment/segment_helper.cc`, lines 623-639), the system populates `MergeOptions`:

   ```cpp
   vector_column_params::MergeOptions merge_options;
   if (concurrency == 0) {
       merge_options.pool = GlobalResource::Instance().optimize_thread_pool();
   } else {
       merge_options.write_concurrency = concurrency;
   }
   ```

5. **Execution**: Tasks run in parallel using the thread pool. After each vector index merge, the new segment is flushed to disk.
6. **Version Update**: The version manager atomically swaps old segment metadata with newly built segments.
7. **Garbage Collection**: Old segments are destroyed and their on-disk files are removed.

### Segment Selection and Delete Ratios

The optimizer specifically targets segments with high delete ratios. When the proportion of deleted documents in a segment exceeds `COMPACT_DELETE_RATIO_THRESHOLD`, the segment is flagged for rebuilding. This threshold ensures that optimize operations focus on segments where compaction will yield the greatest performance improvement.

## When to Use Merge vs Optimize

Choose the appropriate mechanism based on your operational requirements:

- **Frequent small writes**: When your collection accumulates many tiny segments and delete ratios grow, call `collection.Optimize()` to compact everything automatically.
- **Control over specific fields**: When you need to merge only specific vector indexes (e.g., after bulk importing into a single field), use `VectorColumnIndexer::Merge` directly with custom `MergeOptions`.
- **Custom thread pools**: When you have an external thread pool you want to reuse, populate `MergeOptions.pool` with your `ailego::ThreadPool*` and pass it to `Merge`.

## Code Examples

### C++ Simple Optimize

```cpp
#include <zvec/db/collection.h>
#include <zvec/db/options.h>

int main() {
    // Open an existing collection
    auto coll = zvec::Collection::Open("my_collection");
    
    // Optimize using the global optimize thread-pool (concurrency=0)
    zvec::OptimizeOptions opt{0};  // 0 => use GlobalResource::optimize_thread_pool()
    auto status = coll->Optimize(opt);
    
    if (!status.ok()) {
        LOG_ERROR("Optimize failed: %s", status.message().c_str());
    }
    return 0;
}

```

### C++ Manual Merge

```cpp
#include <zvec/db/index/column/vector_column/vector_column_indexer.h>
#include <zvec/db/index/common/index_filter.h>

using namespace zvec;

int main() {
    // Assume we already have two ready-to-merge indexers
    auto target = std::make_shared<VectorColumnIndexer>("path/to/target.idx", field_schema);
    auto src = std::make_shared<VectorColumnIndexer>("path/to/src.idx", field_schema);

    target->Open({true, true});   // mmap, create_new=true
    src->Open({true, true});

    // Optional: filter out deleted doc IDs
    auto filter = std::make_shared<IndexFilter>();
    // ... fill filter ...

    // Merge with 4 writer threads
    vector_column_params::MergeOptions opts;
    opts.write_concurrency = 4;  // or opts.pool = your_thread_pool;
    auto s = target->Merge({src}, filter, opts);
    
    if (!s.ok()) {
        LOG_ERROR("Merge failed: %s", s.message().c_str());
    }
}

```

### Python Optimize Collection

```python
from zvec import Collection, OptimizeOption

# Open an existing collection

coll = Collection.open("my_collection")

# Optimize using the default thread-pool (concurrency=0)

coll.optimize(OptimizeOption())  # equivalent to OptimizeOption(concurrency=0)

# Or limit to 2 threads

coll.optimize(OptimizeOption(concurrency=2))

```

## Key Source Files

| File | Role | Direct Link |
|------|------|-------------|
| `src/db/index/column/vector_column/vector_column_indexer.cc` | Implements `VectorColumnIndexer::Merge`, the low-level merge primitive. | <https://github.com/alibaba/zvec/blob/main/src/db/index/column/vector_column/vector_column_indexer.cc> |
| `src/db/index/segment/segment_helper.cc` | Constructs `MergeOptions` for each column during optimization. | <https://github.com/alibaba/zvec/blob/main/src/db/index/segment/segment_helper.cc> |
| `src/db/collection.cc` | High-level `CollectionImpl::Optimize` driving segment compaction. | <https://github.com/alibaba/zvec/blob/main/src/db/collection.cc> |
| [`src/include/zvec/db/options.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/options.h) | Definition of `OptimizeOptions` and `MergeOptions`. | <https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/options.h> |
| [`python/zvec/model/collection.py`](https://github.com/alibaba/zvec/blob/main/python/zvec/model/collection.py) | Python wrapper exposing `Collection.optimize`. | <https://github.com/alibaba/zvec/blob/main/python/zvec/model/collection.py> |
| [`python/zvec/model/param/__init__.py`](https://github.com/alibaba/zvec/blob/main/python/zvec/model/param/__init__.py) | Python `OptimizeOption` dataclass definition. | <https://github.com/alibaba/zvec/blob/main/python/zvec/model/param/__init__.py> |

## Summary

- **ZVec** stores vectors in segment files with individual Proxima engine indexes, leading to fragmentation over time.
- **Merge** (`VectorColumnIndexer::Merge`) is the engine-level primitive that combines indexes from multiple segments, supporting concurrent writes via `MergeOptions`.
- **Optimize** (`CollectionImpl::Optimize`) orchestrates collection-wide compaction, automatically selecting segments with high delete ratios and invoking Merge for each vector column.
- Configure concurrency through `write_concurrency` (thread count) or `pool` (custom thread pool) in `MergeOptions`.
- Use **Optimize** for routine maintenance after bulk inserts or high deletion rates; use **Merge** directly when you need fine-grained control over specific vector columns.

## Frequently Asked Questions

### What is the difference between Merge and Optimize in ZVec?

**Merge** is a low-level operation implemented in `VectorColumnIndexer::Merge` that combines the Proxima engine indexes of specific source segments into a single target index. **Optimize** is a high-level collection management operation implemented in `CollectionImpl::Optimize` that orchestrates the entire compaction process: it selects segments with high delete ratios, creates new segments, invokes Merge for each vector column, updates metadata, and deletes old files.

### How does concurrency affect ZVec index merging?

Concurrency is controlled through the `MergeOptions` struct passed to `VectorColumnIndexer::Merge`. You can set `write_concurrency` to a specific thread count (e.g., 4) to parallelize the Proxima engine's merge operation, or set `concurrency` to 0 in `OptimizeOptions` to use ZVec's global optimize thread pool via `GlobalResource::Instance().optimize_thread_pool()`. Higher concurrency speeds up large merges but consumes more CPU resources.

### When should I run Optimize on a ZVec collection?

Run `Optimize` after scenarios that create index fragmentation: following bulk imports that generate many small segments, after a high volume of document deletions that leave tombstones in segments, or when query latency degrades due to excessive segment count. The `CollectionImpl::Optimize` function automatically targets segments exceeding `COMPACT_DELETE_RATIO_THRESHOLD`, so it is safe to run periodically as background maintenance.

### Can I use a custom thread pool for ZVec index operations?

Yes. When calling `VectorColumnIndexer::Merge` directly, populate the `pool` field in `MergeOptions` with a pointer to your `ailego::ThreadPool` instance. This overrides the default behavior and executes the merge within your provided pool. This is useful when integrating ZVec's index maintenance with your application's existing thread management or resource isolation frameworks.