How to Optimize ZVec Index Performance with Merge and Optimize: A Complete Guide
Use CollectionImpl::Optimize for automatic collection-wide compaction that selects high-delete-ratio segments and rebuilds indexes, or call VectorColumnIndexer::Merge directly for low-level control over combining specific segment indexes with custom concurrency settings.
ZVec, Alibaba's high-performance vector database, stores vectors in segment files where each segment maintains its own Proxima engine index. Over time, accumulated small segments and deleted documents degrade query latency and increase memory pressure. This guide explains how to optimize zvec index performance with merge and optimize operations using the actual implementation from the alibaba/zvec repository.
Understanding ZVec's Segment Architecture
ZVec persists vectors in immutable segment files. Each segment contains its own vector index implemented by the Proxima engine, along with scalar indexes and metadata. As documents are inserted and deleted, a collection accumulates many small segments and tombstone entries, which forces queries to scan multiple indexes and increases memory overhead. The codebase provides two complementary mechanisms to consolidate this data: Merge for engine-level index operations and Optimize for collection-wide compaction orchestration.
The Merge Mechanism: Low-Level Index Consolidation
VectorColumnIndexer::Merge is the primitive operation that combines the Proxima engine indexes of several source segments into a single target index. This function lives in src/db/index/column/vector_column/vector_column_indexer.cc at lines 101-133.
VectorColumnIndexer::Merge Implementation
The merge process begins when a target VectorColumnIndexer receives a list of source indexers, an optional IndexFilter, and a MergeOptions struct:
Status VectorColumnIndexer::Merge(
const std::vector<VectorColumnIndexer::Ptr> &indexers,
const IndexFilter::Ptr &filter,
const vector_column_params::MergeOptions &merge_options);
Internally, the function unwraps each VectorColumnIndexer to extract its underlying Proxima engine pointer (engine_indexers). The user-provided IndexFilter is converted to a Proxima-compatible filter (engine_filter) to skip deleted documents during the merge. The actual work is delegated to the Proxima engine:
index->Merge(engine_indexers, *engine_filter,
{merge_options.write_concurrency, merge_options.pool});
Configuring MergeOptions
The MergeOptions struct, defined in src/include/zvec/db/options.h, controls concurrency behavior:
struct MergeOptions {
uint32_t write_concurrency{1};
ailego::ThreadPool *pool{nullptr};
};
write_concurrency: When set greater than 0, the Proxima engine parallelizes the merge across that many writer threads.pool: When non-null, the merge executes inside the suppliedailego::ThreadPool, allowing integration with ZVec's global optimize thread pool or custom external pools.
The Optimize Workflow: Collection-Wide Compaction
CollectionImpl::Optimize provides the high-level orchestration for collection-wide index maintenance. This method, located in src/db/collection.cc at lines 686-721, automates segment selection, merge execution, and metadata updates.
CollectionImpl::Optimize Orchestration
The optimize workflow follows a strict sequence to ensure consistency:
-
Locking: Acquires collection-wide write locks to prevent concurrent schema changes and writes.
-
Segment Selection: Gathers all persisted (read-only) segments. If the active writing segment contains data, it is flushed first.
-
Task Building:
build_compact_taskanalyzes delete ratios againstCOMPACT_DELETE_RATIO_THRESHOLDto determine which segments need rebuilding. It createsCreateVectorIndexTaskinstances for each vector column. -
Merge Options Setup: Inside
SegmentHelper::CreateVectorIndexTask(src/db/index/segment/segment_helper.cc, lines 623-639), the system populatesMergeOptions:vector_column_params::MergeOptions merge_options; if (concurrency == 0) { merge_options.pool = GlobalResource::Instance().optimize_thread_pool(); } else { merge_options.write_concurrency = concurrency; } -
Execution: Tasks run in parallel using the thread pool. After each vector index merge, the new segment is flushed to disk.
-
Version Update: The version manager atomically swaps old segment metadata with newly built segments.
-
Garbage Collection: Old segments are destroyed and their on-disk files are removed.
Segment Selection and Delete Ratios
The optimizer specifically targets segments with high delete ratios. When the proportion of deleted documents in a segment exceeds COMPACT_DELETE_RATIO_THRESHOLD, the segment is flagged for rebuilding. This threshold ensures that optimize operations focus on segments where compaction will yield the greatest performance improvement.
When to Use Merge vs Optimize
Choose the appropriate mechanism based on your operational requirements:
- Frequent small writes: When your collection accumulates many tiny segments and delete ratios grow, call
collection.Optimize()to compact everything automatically. - Control over specific fields: When you need to merge only specific vector indexes (e.g., after bulk importing into a single field), use
VectorColumnIndexer::Mergedirectly with customMergeOptions. - Custom thread pools: When you have an external thread pool you want to reuse, populate
MergeOptions.poolwith yourailego::ThreadPool*and pass it toMerge.
Code Examples
C++ Simple Optimize
#include <zvec/db/collection.h>
#include <zvec/db/options.h>
int main() {
// Open an existing collection
auto coll = zvec::Collection::Open("my_collection");
// Optimize using the global optimize thread-pool (concurrency=0)
zvec::OptimizeOptions opt{0}; // 0 => use GlobalResource::optimize_thread_pool()
auto status = coll->Optimize(opt);
if (!status.ok()) {
LOG_ERROR("Optimize failed: %s", status.message().c_str());
}
return 0;
}
C++ Manual Merge
#include <zvec/db/index/column/vector_column/vector_column_indexer.h>
#include <zvec/db/index/common/index_filter.h>
using namespace zvec;
int main() {
// Assume we already have two ready-to-merge indexers
auto target = std::make_shared<VectorColumnIndexer>("path/to/target.idx", field_schema);
auto src = std::make_shared<VectorColumnIndexer>("path/to/src.idx", field_schema);
target->Open({true, true}); // mmap, create_new=true
src->Open({true, true});
// Optional: filter out deleted doc IDs
auto filter = std::make_shared<IndexFilter>();
// ... fill filter ...
// Merge with 4 writer threads
vector_column_params::MergeOptions opts;
opts.write_concurrency = 4; // or opts.pool = your_thread_pool;
auto s = target->Merge({src}, filter, opts);
if (!s.ok()) {
LOG_ERROR("Merge failed: %s", s.message().c_str());
}
}
Python Optimize Collection
from zvec import Collection, OptimizeOption
# Open an existing collection
coll = Collection.open("my_collection")
# Optimize using the default thread-pool (concurrency=0)
coll.optimize(OptimizeOption()) # equivalent to OptimizeOption(concurrency=0)
# Or limit to 2 threads
coll.optimize(OptimizeOption(concurrency=2))
Key Source Files
Summary
- ZVec stores vectors in segment files with individual Proxima engine indexes, leading to fragmentation over time.
- Merge (
VectorColumnIndexer::Merge) is the engine-level primitive that combines indexes from multiple segments, supporting concurrent writes viaMergeOptions. - Optimize (
CollectionImpl::Optimize) orchestrates collection-wide compaction, automatically selecting segments with high delete ratios and invoking Merge for each vector column. - Configure concurrency through
write_concurrency(thread count) orpool(custom thread pool) inMergeOptions. - Use Optimize for routine maintenance after bulk inserts or high deletion rates; use Merge directly when you need fine-grained control over specific vector columns.
Frequently Asked Questions
What is the difference between Merge and Optimize in ZVec?
Merge is a low-level operation implemented in VectorColumnIndexer::Merge that combines the Proxima engine indexes of specific source segments into a single target index. Optimize is a high-level collection management operation implemented in CollectionImpl::Optimize that orchestrates the entire compaction process: it selects segments with high delete ratios, creates new segments, invokes Merge for each vector column, updates metadata, and deletes old files.
How does concurrency affect ZVec index merging?
Concurrency is controlled through the MergeOptions struct passed to VectorColumnIndexer::Merge. You can set write_concurrency to a specific thread count (e.g., 4) to parallelize the Proxima engine's merge operation, or set concurrency to 0 in OptimizeOptions to use ZVec's global optimize thread pool via GlobalResource::Instance().optimize_thread_pool(). Higher concurrency speeds up large merges but consumes more CPU resources.
When should I run Optimize on a ZVec collection?
Run Optimize after scenarios that create index fragmentation: following bulk imports that generate many small segments, after a high volume of document deletions that leave tombstones in segments, or when query latency degrades due to excessive segment count. The CollectionImpl::Optimize function automatically targets segments exceeding COMPACT_DELETE_RATIO_THRESHOLD, so it is safe to run periodically as background maintenance.
Can I use a custom thread pool for ZVec index operations?
Yes. When calling VectorColumnIndexer::Merge directly, populate the pool field in MergeOptions with a pointer to your ailego::ThreadPool instance. This overrides the default behavior and executes the merge within your provided pool. This is useful when integrating ZVec's index maintenance with your application's existing thread management or resource isolation frameworks.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →