Internal Structure of ZVec Segments: A Deep Dive into Alibaba's Vector Database Storage

A ZVec segment is a self-contained storage unit that bundles raw forward data, scalar indexes, vector indexes, and metadata into a block-based architecture managed by SegmentImpl and SegmentMeta classes.

The internal structure of ZVec segments forms the foundation of Alibaba's zvec vector database, determining how collections store, index, and retrieve high-dimensional data. Each segment operates as an independent shard within a collection, managing its own write-ahead log, version control, and block-based persistence layer. Understanding this architecture is essential for optimizing storage layout, tuning flush thresholds, and debugging performance bottlenecks in production deployments.

Core Components of a ZVec Segment

Segment Interface and Implementation

The segment architecture follows a clear separation between interface and implementation. The Segment interface in src/db/index/segment/segment.h defines the high-level API for creating, opening, inserting, fetching, and scanning documents. This abstract class provides static factory methods CreateAndOpen and Open that return Segment::Ptr instances.

The concrete implementation resides in SegmentImpl within src/db/index/segment/segment.cc. This class manages the runtime state required for fast writes and reads, including memory-mapped forward stores, inverted indexers, and vector column indexers. All write operations acquire a std::lock_guard on seg_mtx_ to ensure thread safety during concurrent insertions and updates.

Metadata Hierarchy: SegmentMeta and BlockMeta

Every segment maintains a metadata hierarchy defined in src/db/index/common/meta.h. SegmentMeta serves as the segment-level header, storing the segment ID, a list of persisted blocks, the currently active writing forward block, and the set of indexed vector fields.

BlockMeta describes individual persisted blocks within a segment. Each block metadata entry tracks the block type, document ID range, and column list. These metadata structures enable the segment to locate specific data blocks during read operations without scanning the entire storage file.

Block Types and Storage Classification

The BlockType enum in src/include/zvec/db/type.h categorizes blocks into four distinct types based on their storage purpose:

// src/include/zvec/db/type.h
enum BlockType : uint32_t {
    UNDEFINED = 0,
    SCALAR = 1,               // forward columns (global id, pk, user fields)
    SCALAR_INDEX = 2,         // inverted indexes for scalar fields
    VECTOR_INDEX = 3,         // vector column indexes (e.g. HNSW, Flat)
    VECTOR_INDEX_QUANTIZE = 4 // optional quantized vector indexes
};

SCALAR blocks store forward column values including global document IDs, primary keys, and user-defined fields. These are the only blocks that accept writes while a segment remains open. SCALAR_INDEX blocks contain persisted inverted indexes for scalar fields, while VECTOR_INDEX and VECTOR_INDEX_QUANTIZE blocks store high-dimensional vector indexes and their quantized variants respectively.

Block-Based Storage Architecture

ZVec segments organize all persistent data into immutable blocks. When a segment is created, Segment::CreateAndOpen initializes an empty SCALAR forward block and writes the initial SegmentMeta to disk. As documents are inserted, the active writing block accumulates data in memory through MemForwardStore.

When the memory block reaches a configurable threshold, the flush() method writes the block to persistent storage, creates a BlockMeta entry with type SCALAR, and updates SegmentMeta::persisted_blocks_. The segment then allocates a new writing forward block with a fresh BlockID generated by block_id_allocator_.

Scalar and vector indexes follow a similar block lifecycle but are typically built during background compaction rather than the hot write path. The SegmentImpl maintains separate collections for persisted scalar indexes (persist_stores_) and vector indexes (vector_indexers_ and quant_vector_indexers_), enabling efficient ANN searches across multiple blocks.

In-Memory Runtime State

SegmentImpl Member Structure

The SegmentImpl class maintains a sophisticated runtime state to bridge the gap between persistent blocks and active operations. Key members include:

  • MemForwardStore::Ptr memory_store_ – In-memory buffer for the active writing block, receiving all forward column data during inserts.
  • std::vector<BaseForwardStore::Ptr> persist_stores_ – Read-only persisted forward stores loaded from SCALAR blocks on disk.
  • InvertedIndexer::Ptr invert_indexers_ – Scalar inverted indexes supporting range and equality queries on non-vector fields.
  • std::unordered_map<std::string, VectorColumnIndexer::Ptr> memory_vector_indexers_ – Active vector indexers for the writing block, supporting HNSW or Flat index types.
  • std::unordered_map<std::string, std::vector<VectorColumnIndexer::Ptr>> vector_indexers_ – Persisted vector indexers organized by field name across all blocks.
  • std::unordered_map<std::string, std::vector<VectorColumnIndexer::Ptr>> quant_vector_indexers_ – Optional quantized vector indexes for memory-efficient ANN search.
  • IDMap::Ptr id_map_ and DeleteStore::Ptr delete_store_ – Primary key to global document ID mapping and soft-delete bookkeeping.
  • VersionManager::Ptr version_manager_ – Handles version-specific features such as memory-mapped I/O enablement.
  • WalFilePtr wal_file_ – Write-ahead log ensuring durability for unflushed operations.
  • std::atomic<uint64_t> doc_id_allocator_ and std::atomic<BlockID> block_id_allocator_ – Monotonic ID generators for documents and blocks.

Segment Lifecycle Operations

Creation and Opening

Segments are instantiated through two primary factory methods defined in src/db/index/segment/segment.h. Segment::CreateAndOpen initializes a new segment directory, allocates the first BlockID, creates an empty MemForwardStore, and writes the initial SegmentMeta footer to disk. This method is invoked when a collection needs to expand storage capacity beyond existing segments.

Segment::Open reconstructs a segment from existing disk state. It reads the SegmentMeta footer using IndexFormat structures from src/include/zvec/core/framework/index_format.h, reloads all persisted forward stores into persist_stores_, and reconstructs scalar and vector indexers from their respective blocks. The method ensures crash recovery by replaying any unflushed WAL entries via wal_file_.

Write Path: Insert, Update, and Flush

All write operations acquire a std::lock_guard<std::mutex> on seg_mtx_ to ensure thread safety. The Insert method follows a coordinated write pattern:

  1. Primary Key Mapping: The document's primary key is registered in IDMap, which returns a monotonic global document ID from doc_id_allocator_.
  2. Forward Storage: Column values are appended to memory_store_ using MemForwardStore::insert.
  3. Scalar Indexing: Scalar fields are indexed through InsertScalar, which updates invert_indexers_ with inverted index entries.
  4. Vector Indexing: Vector fields are processed by InsertVector, which adds vectors to memory_vector_indexers_ (e.g., HNSW graphs).

When memory_store_ reaches the configured flush threshold, flush() persists the block: it writes the MemForwardStore to disk as a new BaseForwardStore subclass, creates a BlockMeta with type SCALAR, appends it to SegmentMeta::persisted_blocks_, and atomically updates the footer. A new BlockID is allocated from block_id_allocator_, and a fresh MemForwardStore becomes the active writing block.

Read Path: Fetch and Scan

Read operations leverage cached block metadata to minimize I/O. The Fetch method for single-document retrieval:

  1. Block Location: Uses find_persist_block_id and persist_block_offsets_ to map the global document ID to a specific block and local row offset.
  2. Data Retrieval: Delegates to the appropriate BaseForwardStore in persist_stores_ to read column values.
  3. Vector Reconstruction: If the document contains vector fields, retrieves the vector from the relevant VectorColumnIndexer in vector_indexers_ using the same block-local offset.

Scan operations iterate across persist_stores_ and apply predicate pushdown through invert_indexers_ for scalar filters. Vector similarity search utilizes memory_vector_indexers_ for the active block and vector_indexers_ for persisted blocks, with optional routing to quant_vector_indexers_ when quantized search is enabled.

Code Example: Working with Segments

The following example demonstrates creating a segment, inserting a document with scalar and vector fields, and fetching the data back:

#include <zvec/db/schema.h>
#include <zvec/db/segment.h>
#include <zvec/db/id_map.h>
#include <zvec/db/delete_store.h>
#include <zvec/db/version_manager.h>
#include <zvec/db/options.h>

using namespace zvec;

int main() {
  // 1️⃣ Build collection schema (simplified)
  auto schema = std::make_shared<CollectionSchema>();
  schema->add_forward_field(FieldSchema::MakeInt64("user_id"));
  schema->add_vector_field(FieldSchema::MakeFloatVector("vec", 128,
                           std::make_shared<VectorIndexParams>(IndexType::HNSW)));

  // 2️⃣ Helpers required by a segment
  auto id_map = std::make_shared<IDMap>();
  auto delete_store = std::make_shared<DeleteStore>();
  auto version_manager = std::make_shared<VersionManager>();
  SegmentOptions opts;   // default options

  // 3️⃣ Create a brand‑new segment on disk
  auto seg_res = Segment::CreateAndOpen(
      "./data/segment_0",          // path
      *schema,                     // collection schema
      0,                           // segment id
      0,                           // min_doc_id (global)
      id_map, delete_store,
      version_manager, opts);
  Segment::Ptr seg = seg_res.value();   // assume OK

  // 4️⃣ Insert a document
  Doc doc;
  doc.set_pk("doc_001");
  doc.set("user_id", int64_t(12345));
  std::vector<float> vec(128, 0.1f);   // dummy vector
  doc.set("vec", vec);
  seg->Insert(doc);                    // writes forward, scalar & vector indexes

  // 5️⃣ Fetch the same document by global id
  auto fetched = seg->Fetch(/*global_doc_id*/0);
  std::cout << "Fetched pk: " << fetched->pk() << std::endl;
}

Creating a segment: Segment::CreateAndOpen → writes an empty SCALAR forward block and a fresh SegmentMeta (see src/db/index/segment/segment.h).
Inserting: SegmentImpl::Insertinternal_insert → calls memory_store_->insert, insert_scalar_indexer, insert_vector_indexer.
Fetching: SegmentImpl::Fetch uses find_persist_block_id, persist_block_offsets_, and the appropriate vector indexer to rebuild the Doc object.

Key Source Files

File Description
src/db/index/segment/segment.h Abstract Segment interface and static factory methods (CreateAndOpen, Open).
src/db/index/segment/segment.cc Concrete SegmentImpl implementation containing runtime state, read/write logic, and flush mechanics.
src/db/index/common/meta.h BlockMeta and SegmentMeta definitions governing the metadata layout for blocks and segments.
src/include/zvec/db/type.h BlockType enum (SCALAR, SCALAR_INDEX, VECTOR_INDEX, VECTOR_INDEX_QUANTIZE) and core type definitions.
src/db/index/segment/segment_manager.h SegmentManager class managing the registry of all segments belonging to a collection.
src/include/zvec/core/framework/index_format.h On-disk index format structures (IndexFormat::MetaHeader, MetaFooter, SegmentMetaBuffer).
src/db/index/segment/segment_manager.cc Implementation of segment registry operations including add_segment and destroy_segment.
src/tools/core/meta_segment_common.h Tag-list segment names used by the toolchain (e.g., local_taglists_header).
src/db/index/segment/segment_helper.cc Helper functions for creating scalar and vector indexes during segment initialization.

Summary

  • A ZVec segment is a self-contained storage shard that manages forward data, scalar indexes, and vector indexes through a unified block-based architecture defined in src/db/index/segment/segment.h and src/db/index/common/meta.h.
  • Four distinct block types (SCALAR, SCALAR_INDEX, VECTOR_INDEX, VECTOR_INDEX_QUANTIZE) organize data physically on disk, with only SCALAR blocks accepting active writes while the segment is open.
  • SegmentMeta and BlockMeta provide the metadata header structure that maps document IDs to physical block locations, enabling O(1) lookups via cached offset arrays in SegmentImpl.
  • SegmentImpl maintains complex runtime state including MemForwardStore for active writes, InvertedIndexer for scalar queries, and VectorColumnIndexer instances for approximate nearest neighbor search, all synchronized through seg_mtx_.
  • Lifecycle operations (CreateAndOpen, Insert, flush, Fetch) coordinate WAL durability, block allocation via block_id_allocator_, and metadata footer updates to ensure crash consistency.

Frequently Asked Questions

What is the difference between SegmentMeta and BlockMeta in ZVec?

SegmentMeta acts as the segment-level header containing the segment ID, a vector of all persisted blocks, the currently active writing forward block, and the set of indexed vector fields. BlockMeta describes individual data blocks, specifying the BlockType (SCALAR, SCALAR_INDEX, etc.), the document ID range contained within, and the list of columns stored. While SegmentMeta provides the directory structure for the entire segment, BlockMeta entries enable the segment to locate specific data shards during read operations.

How does ZVec handle writes to different block types?

ZVec restricts active writes to SCALAR blocks only, which store forward column values including global document IDs and primary keys. When documents are inserted via SegmentImpl::Insert, the data flows into MemForwardStore (the in-memory representation of the active SCALAR block). Scalar indexes (SCALAR_INDEX) and vector indexes (VECTOR_INDEX, VECTOR_INDEX_QUANTIZE) are typically built during background compaction or when the active block is flushed to disk, rather than being updated in real-time on the hot write path.

What happens during a segment flush operation?

When the in-memory forward store reaches its configured threshold, SegmentImpl::flush() executes an atomic persistence operation. The method writes the MemForwardStore contents to disk as a persisted BaseForwardStore subclass, creates a new BlockMeta with type SCALAR, and appends this metadata to SegmentMeta::persisted_blocks_. The segment footer is atomically updated to reflect the new block layout, a fresh BlockID is allocated from block_id_allocator_, and a new empty MemForwardStore becomes the active writing buffer. This process ensures that previously written data becomes immutable and read-optimized while new writes continue uninterrupted.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →