# Internal Structure of ZVec Segments: A Deep Dive into Alibaba's Vector Database Storage

> Explore the internal structure of ZVec segments, Alibaba's vector database storage. Understand how raw data, indexes, and metadata are managed in block-based architecture.

- Repository: [Alibaba/zvec](https://github.com/alibaba/zvec)
- Tags: deep-dive
- Published: 2026-02-16

---

**A ZVec segment is a self-contained storage unit that bundles raw forward data, scalar indexes, vector indexes, and metadata into a block-based architecture managed by `SegmentImpl` and `SegmentMeta` classes.**

The internal structure of ZVec segments forms the foundation of Alibaba's `zvec` vector database, determining how collections store, index, and retrieve high-dimensional data. Each segment operates as an independent shard within a collection, managing its own write-ahead log, version control, and block-based persistence layer. Understanding this architecture is essential for optimizing storage layout, tuning flush thresholds, and debugging performance bottlenecks in production deployments.

## Core Components of a ZVec Segment

### Segment Interface and Implementation

The segment architecture follows a clear separation between interface and implementation. The **`Segment`** interface in [`src/db/index/segment/segment.h`](https://github.com/alibaba/zvec/blob/main/src/db/index/segment/segment.h) defines the high-level API for creating, opening, inserting, fetching, and scanning documents. This abstract class provides static factory methods `CreateAndOpen` and `Open` that return `Segment::Ptr` instances.

The concrete implementation resides in **`SegmentImpl`** within `src/db/index/segment/segment.cc`. This class manages the runtime state required for fast writes and reads, including memory-mapped forward stores, inverted indexers, and vector column indexers. All write operations acquire a `std::lock_guard` on `seg_mtx_` to ensure thread safety during concurrent insertions and updates.

### Metadata Hierarchy: SegmentMeta and BlockMeta

Every segment maintains a metadata hierarchy defined in [`src/db/index/common/meta.h`](https://github.com/alibaba/zvec/blob/main/src/db/index/common/meta.h). **`SegmentMeta`** serves as the segment-level header, storing the segment ID, a list of persisted blocks, the currently active writing forward block, and the set of indexed vector fields.

**`BlockMeta`** describes individual persisted blocks within a segment. Each block metadata entry tracks the block type, document ID range, and column list. These metadata structures enable the segment to locate specific data blocks during read operations without scanning the entire storage file.

### Block Types and Storage Classification

The **`BlockType`** enum in [`src/include/zvec/db/type.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/type.h) categorizes blocks into four distinct types based on their storage purpose:

```cpp
// src/include/zvec/db/type.h
enum BlockType : uint32_t {
    UNDEFINED = 0,
    SCALAR = 1,               // forward columns (global id, pk, user fields)
    SCALAR_INDEX = 2,         // inverted indexes for scalar fields
    VECTOR_INDEX = 3,         // vector column indexes (e.g. HNSW, Flat)
    VECTOR_INDEX_QUANTIZE = 4 // optional quantized vector indexes
};

```

**SCALAR** blocks store forward column values including global document IDs, primary keys, and user-defined fields. These are the only blocks that accept writes while a segment remains open. **SCALAR_INDEX** blocks contain persisted inverted indexes for scalar fields, while **VECTOR_INDEX** and **VECTOR_INDEX_QUANTIZE** blocks store high-dimensional vector indexes and their quantized variants respectively.

## Block-Based Storage Architecture

ZVec segments organize all persistent data into immutable blocks. When a segment is created, `Segment::CreateAndOpen` initializes an empty **SCALAR** forward block and writes the initial `SegmentMeta` to disk. As documents are inserted, the active writing block accumulates data in memory through `MemForwardStore`.

When the memory block reaches a configurable threshold, the `flush()` method writes the block to persistent storage, creates a `BlockMeta` entry with type `SCALAR`, and updates `SegmentMeta::persisted_blocks_`. The segment then allocates a new writing forward block with a fresh `BlockID` generated by `block_id_allocator_`.

Scalar and vector indexes follow a similar block lifecycle but are typically built during background compaction rather than the hot write path. The `SegmentImpl` maintains separate collections for persisted scalar indexes (`persist_stores_`) and vector indexes (`vector_indexers_` and `quant_vector_indexers_`), enabling efficient ANN searches across multiple blocks.

## In-Memory Runtime State

### SegmentImpl Member Structure

The `SegmentImpl` class maintains a sophisticated runtime state to bridge the gap between persistent blocks and active operations. Key members include:

- **`MemForwardStore::Ptr memory_store_`** – In-memory buffer for the active writing block, receiving all forward column data during inserts.
- **`std::vector<BaseForwardStore::Ptr> persist_stores_`** – Read-only persisted forward stores loaded from `SCALAR` blocks on disk.
- **`InvertedIndexer::Ptr invert_indexers_`** – Scalar inverted indexes supporting range and equality queries on non-vector fields.
- **`std::unordered_map<std::string, VectorColumnIndexer::Ptr> memory_vector_indexers_`** – Active vector indexers for the writing block, supporting HNSW or Flat index types.
- **`std::unordered_map<std::string, std::vector<VectorColumnIndexer::Ptr>> vector_indexers_`** – Persisted vector indexers organized by field name across all blocks.
- **`std::unordered_map<std::string, std::vector<VectorColumnIndexer::Ptr>> quant_vector_indexers_`** – Optional quantized vector indexes for memory-efficient ANN search.
- **`IDMap::Ptr id_map_`** and **`DeleteStore::Ptr delete_store_`** – Primary key to global document ID mapping and soft-delete bookkeeping.
- **`VersionManager::Ptr version_manager_`** – Handles version-specific features such as memory-mapped I/O enablement.
- **`WalFilePtr wal_file_`** – Write-ahead log ensuring durability for unflushed operations.
- **`std::atomic<uint64_t> doc_id_allocator_`** and **`std::atomic<BlockID> block_id_allocator_`** – Monotonic ID generators for documents and blocks.

## Segment Lifecycle Operations

### Creation and Opening

Segments are instantiated through two primary factory methods defined in [`src/db/index/segment/segment.h`](https://github.com/alibaba/zvec/blob/main/src/db/index/segment/segment.h). **`Segment::CreateAndOpen`** initializes a new segment directory, allocates the first `BlockID`, creates an empty `MemForwardStore`, and writes the initial `SegmentMeta` footer to disk. This method is invoked when a collection needs to expand storage capacity beyond existing segments.

**`Segment::Open`** reconstructs a segment from existing disk state. It reads the `SegmentMeta` footer using `IndexFormat` structures from [`src/include/zvec/core/framework/index_format.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/core/framework/index_format.h), reloads all persisted forward stores into `persist_stores_`, and reconstructs scalar and vector indexers from their respective blocks. The method ensures crash recovery by replaying any unflushed WAL entries via `wal_file_`.

### Write Path: Insert, Update, and Flush

All write operations acquire a `std::lock_guard<std::mutex>` on `seg_mtx_` to ensure thread safety. The **`Insert`** method follows a coordinated write pattern:

1. **Primary Key Mapping**: The document's primary key is registered in `IDMap`, which returns a monotonic global document ID from `doc_id_allocator_`.
2. **Forward Storage**: Column values are appended to `memory_store_` using `MemForwardStore::insert`.
3. **Scalar Indexing**: Scalar fields are indexed through `InsertScalar`, which updates `invert_indexers_` with inverted index entries.
4. **Vector Indexing**: Vector fields are processed by `InsertVector`, which adds vectors to `memory_vector_indexers_` (e.g., HNSW graphs).

When `memory_store_` reaches the configured flush threshold, **`flush()`** persists the block: it writes the `MemForwardStore` to disk as a new `BaseForwardStore` subclass, creates a `BlockMeta` with type `SCALAR`, appends it to `SegmentMeta::persisted_blocks_`, and atomically updates the footer. A new `BlockID` is allocated from `block_id_allocator_`, and a fresh `MemForwardStore` becomes the active writing block.

### Read Path: Fetch and Scan

Read operations leverage cached block metadata to minimize I/O. The **`Fetch`** method for single-document retrieval:

1. **Block Location**: Uses `find_persist_block_id` and `persist_block_offsets_` to map the global document ID to a specific block and local row offset.
2. **Data Retrieval**: Delegates to the appropriate `BaseForwardStore` in `persist_stores_` to read column values.
3. **Vector Reconstruction**: If the document contains vector fields, retrieves the vector from the relevant `VectorColumnIndexer` in `vector_indexers_` using the same block-local offset.

**Scan** operations iterate across `persist_stores_` and apply predicate pushdown through `invert_indexers_` for scalar filters. Vector similarity search utilizes `memory_vector_indexers_` for the active block and `vector_indexers_` for persisted blocks, with optional routing to `quant_vector_indexers_` when quantized search is enabled.

## Code Example: Working with Segments

The following example demonstrates creating a segment, inserting a document with scalar and vector fields, and fetching the data back:

```cpp
#include <zvec/db/schema.h>
#include <zvec/db/segment.h>
#include <zvec/db/id_map.h>
#include <zvec/db/delete_store.h>
#include <zvec/db/version_manager.h>
#include <zvec/db/options.h>

using namespace zvec;

int main() {
  // 1️⃣ Build collection schema (simplified)
  auto schema = std::make_shared<CollectionSchema>();
  schema->add_forward_field(FieldSchema::MakeInt64("user_id"));
  schema->add_vector_field(FieldSchema::MakeFloatVector("vec", 128,
                           std::make_shared<VectorIndexParams>(IndexType::HNSW)));

  // 2️⃣ Helpers required by a segment
  auto id_map = std::make_shared<IDMap>();
  auto delete_store = std::make_shared<DeleteStore>();
  auto version_manager = std::make_shared<VersionManager>();
  SegmentOptions opts;   // default options

  // 3️⃣ Create a brand‑new segment on disk
  auto seg_res = Segment::CreateAndOpen(
      "./data/segment_0",          // path
      *schema,                     // collection schema
      0,                           // segment id
      0,                           // min_doc_id (global)
      id_map, delete_store,
      version_manager, opts);
  Segment::Ptr seg = seg_res.value();   // assume OK

  // 4️⃣ Insert a document
  Doc doc;
  doc.set_pk("doc_001");
  doc.set("user_id", int64_t(12345));
  std::vector<float> vec(128, 0.1f);   // dummy vector
  doc.set("vec", vec);
  seg->Insert(doc);                    // writes forward, scalar & vector indexes

  // 5️⃣ Fetch the same document by global id
  auto fetched = seg->Fetch(/*global_doc_id*/0);
  std::cout << "Fetched pk: " << fetched->pk() << std::endl;
}

```

*Creating a segment*: `Segment::CreateAndOpen` → writes an empty `SCALAR` forward block and a fresh `SegmentMeta` (see [`src/db/index/segment/segment.h`](https://github.com/alibaba/zvec/blob/main/src/db/index/segment/segment.h)).  
*Inserting*: `SegmentImpl::Insert` → `internal_insert` → calls `memory_store_->insert`, `insert_scalar_indexer`, `insert_vector_indexer`.  
*Fetching*: `SegmentImpl::Fetch` uses `find_persist_block_id`, `persist_block_offsets_`, and the appropriate vector indexer to rebuild the `Doc` object.

## Key Source Files

| File | Description |
|------|-------------|
| **[`src/db/index/segment/segment.h`](https://github.com/alibaba/zvec/blob/main/src/db/index/segment/segment.h)** | Abstract `Segment` interface and static factory methods (`CreateAndOpen`, `Open`). |
| **`src/db/index/segment/segment.cc`** | Concrete `SegmentImpl` implementation containing runtime state, read/write logic, and flush mechanics. |
| **[`src/db/index/common/meta.h`](https://github.com/alibaba/zvec/blob/main/src/db/index/common/meta.h)** | `BlockMeta` and `SegmentMeta` definitions governing the metadata layout for blocks and segments. |
| **[`src/include/zvec/db/type.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/type.h)** | `BlockType` enum (`SCALAR`, `SCALAR_INDEX`, `VECTOR_INDEX`, `VECTOR_INDEX_QUANTIZE`) and core type definitions. |
| **[`src/db/index/segment/segment_manager.h`](https://github.com/alibaba/zvec/blob/main/src/db/index/segment/segment_manager.h)** | `SegmentManager` class managing the registry of all segments belonging to a collection. |
| **[`src/include/zvec/core/framework/index_format.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/core/framework/index_format.h)** | On-disk index format structures (`IndexFormat::MetaHeader`, `MetaFooter`, `SegmentMetaBuffer`). |
| **`src/db/index/segment/segment_manager.cc`** | Implementation of segment registry operations including `add_segment` and `destroy_segment`. |
| **[`src/tools/core/meta_segment_common.h`](https://github.com/alibaba/zvec/blob/main/src/tools/core/meta_segment_common.h)** | Tag-list segment names used by the toolchain (e.g., `local_taglists_header`). |
| **`src/db/index/segment/segment_helper.cc`** | Helper functions for creating scalar and vector indexes during segment initialization. |

## Summary

- **A ZVec segment is a self-contained storage shard** that manages forward data, scalar indexes, and vector indexes through a unified block-based architecture defined in [`src/db/index/segment/segment.h`](https://github.com/alibaba/zvec/blob/main/src/db/index/segment/segment.h) and [`src/db/index/common/meta.h`](https://github.com/alibaba/zvec/blob/main/src/db/index/common/meta.h).
- **Four distinct block types** (`SCALAR`, `SCALAR_INDEX`, `VECTOR_INDEX`, `VECTOR_INDEX_QUANTIZE`) organize data physically on disk, with only `SCALAR` blocks accepting active writes while the segment is open.
- **SegmentMeta and BlockMeta** provide the metadata header structure that maps document IDs to physical block locations, enabling O(1) lookups via cached offset arrays in `SegmentImpl`.
- **SegmentImpl maintains complex runtime state** including `MemForwardStore` for active writes, `InvertedIndexer` for scalar queries, and `VectorColumnIndexer` instances for approximate nearest neighbor search, all synchronized through `seg_mtx_`.
- **Lifecycle operations** (`CreateAndOpen`, `Insert`, `flush`, `Fetch`) coordinate WAL durability, block allocation via `block_id_allocator_`, and metadata footer updates to ensure crash consistency.

## Frequently Asked Questions

### What is the difference between SegmentMeta and BlockMeta in ZVec?

**`SegmentMeta`** acts as the segment-level header containing the segment ID, a vector of all persisted blocks, the currently active writing forward block, and the set of indexed vector fields. **`BlockMeta`** describes individual data blocks, specifying the `BlockType` (SCALAR, SCALAR_INDEX, etc.), the document ID range contained within, and the list of columns stored. While `SegmentMeta` provides the directory structure for the entire segment, `BlockMeta` entries enable the segment to locate specific data shards during read operations.

### How does ZVec handle writes to different block types?

ZVec restricts active writes to **SCALAR blocks only**, which store forward column values including global document IDs and primary keys. When documents are inserted via `SegmentImpl::Insert`, the data flows into `MemForwardStore` (the in-memory representation of the active SCALAR block). Scalar indexes (`SCALAR_INDEX`) and vector indexes (`VECTOR_INDEX`, `VECTOR_INDEX_QUANTIZE`) are typically built during background compaction or when the active block is flushed to disk, rather than being updated in real-time on the hot write path.

### What happens during a segment flush operation?

When the in-memory forward store reaches its configured threshold, `SegmentImpl::flush()` executes an atomic persistence operation. The method writes the `MemForwardStore` contents to disk as a persisted `BaseForwardStore` subclass, creates a new `BlockMeta` with type `SCALAR`, and appends this metadata to `SegmentMeta::persisted_blocks_`. The segment footer is atomically updated to reflect the new block layout, a fresh `BlockID` is allocated from `block_id_allocator_`, and a new empty `MemForwardStore` becomes the active writing buffer. This process ensures that previously written data becomes immutable and read-optimized while new writes continue uninterrupted.