How ZVec Handles Schema Evolution: AddColumn and AlterColumn Implementation

ZVec handles schema evolution by treating collection schemas as versioned immutable objects, where both AddColumn and AlterColumn operations create new schema versions, flush existing writing segments, and propagate changes across all persisted segments through parallel execution before atomically updating the version manifest.

ZVec, Alibaba's high-performance vector database, implements robust schema evolution mechanisms that allow users to modify collection structures without downtime. Understanding how add_column and alter_column work internally reveals the system's transactional guarantees and Arrow-based execution engine. This article examines the implementation details found in the alibaba/zvec repository, including specific file paths and function signatures that power these schema modifications.

Understanding ZVec Schema Evolution

ZVec approaches schema evolution through an immutable versioning strategy. When you invoke AddColumn or AlterColumn, the system does not modify the existing schema in place. Instead, it follows a strict pipeline:

  1. Validation at the collection level checks type compatibility, nullability constraints, and expression syntax.
  2. Schema cloning creates a new CollectionSchema object incorporating the requested changes.
  3. Segment flushing ensures the current writing segment (if it contains data) is persisted and sealed.
  4. Parallel propagation applies the schema change to every persisted segment via SegmentManager.
  5. Atomic version update commits the new schema to the VersionManager manifest, making it visible to all readers.

Both operations share this pipeline, differing only in how column data is generated and validated.

The AddColumn Operation

The AddColumn operation in ZVec supports two distinct modes: creating a null-filled column or generating a computed column through expression evaluation. The entry point resides in src/include/zvec/db/collection.h, with the core implementation in src/db/collection.cc.

Null Columns vs Computed Columns

When adding a column, ZVec determines the data source based on whether an expression is provided:

  • Null column: If no expression is provided (empty string), ZVec creates a column filled with null values. This requires the field to be marked as nullable in the schema.
  • Computed column: If a SQL-like expression is provided (e.g., "int_score + float_score"), ZVec evaluates this expression against existing scalar columns using Arrow's computation engine.

Implementation in SegmentImpl

The SegmentImpl::add_column method in src/db/index/segment/segment.cc (lines 36-84 and 94-131) handles the physical column creation:

Status SegmentImpl::add_column(FieldSchema::Ptr column_schema,
                               const std::string &expression,
                               const AddColumnOptions & /*options*/) {
  // Reject in-memory segments (cannot rewrite on-disk blocks)
  if (memory_store_) {
    return Status::NotSupported(
        "Add column is not supported for segment with memory store");
  }

  // Convert collection schema to Arrow fields
  std::vector<std::shared_ptr<arrow::Field>> fields;
  ConvertCollectionSchemaToArrowFields(collection_schema_, &fields);
  auto physic_schema = std::make_shared<arrow::Schema>(fields);

  // Prepare Arrow field for new column
  std::shared_ptr<arrow::Field> arrow_field;
  ConvertFieldSchemaToArrowField(column_schema.get(), &arrow_field);

  // Build new column data
  std::shared_ptr<arrow::ChunkedArray> new_column;
  if (expression.empty()) {
    // Create null column
    arrow::Result<std::shared_ptr<arrow::Array>> result =
        arrow::MakeArrayOfNull(arrow_field->type(),
                               scalar_blocks[0].doc_count_);
    new_column = std::make_shared<arrow::ChunkedArray>(std::vector{
        result.ValueOrDie()});
  } else {
    // Evaluate expression over existing columns
    auto p_result = ParseToExpression(expression, physic_schema);
    auto dataset = ReadBlocksAsDataset(...);
    auto eval_result = EvaluateExpressionWithDataset(dataset,
                         column_schema->name(), expr, arrow_field->type());
    new_column = eval_result.ValueOrDie()->column(0);
  }

  // Write column into new scalar blocks
  WriteColumnInBlocks(column_schema->name(), new_column,
                      filter_column_blocks, path_, segment_meta_->id(),
                      [this]() { return allocate_block_id(); },
                      !options_.enable_mmap_, &new_blocks);

  // Rebuild inverted index if needed
  if (column_schema->has_invert_index()) {
    reopen_invert_indexer();
    invert_indexers_->create_column_indexer(*column_schema);
  }

  // Update segment metadata
  segment_meta_->add_column_meta(column_schema->name(),
                                 column_schema->has_invert_index(),
                                 /* other meta fields */);
  return Status::OK();
}

The AlterColumn Operation

While AddColumn creates new fields, AlterColumn modifies existing ones through the CollectionImpl::AlterColumn method in src/db/collection.cc. This operation supports renaming columns or changing their definitions (type, index settings, etc.).

Renaming and Redefining Fields

The SegmentImpl::alter_column implementation in src/db/index/segment/segment.cc (lines 216-226) handles both scenarios:

Status SegmentImpl::alter_column(const std::string &column_name,
                                 const FieldSchema::Ptr &new_column_schema,
                                 const AlterColumnOptions & /*options*/) {
  // Validation omitted for brevity
  
  // 1. Replace Arrow field definition in the collection schema
  // 2. Rewrite every scalar block that contains the column
  //    (similar WriteColumnInBlocks flow as add_column)
  // 3. Re-build inverted index if needed
  // 4. Update segment meta information
  return Status::OK();
}

Unlike AddColumn, which generates new data, AlterColumn preserves existing data while rewriting blocks to reflect new metadata (such as field names or index configurations).

Transactional Guarantees and Version Management

ZVec ensures atomic schema evolution through its VersionManager. After SegmentManager parallelizes the operation across all segments, CollectionImpl constructs a new version:

Version new_version = version_manager_->get_current_version();
new_version.set_schema(*new_schema);
new_version.reset_writing_segment_meta(writing_segment_->meta());

for (auto meta : segment_manager_->get_segments_meta()) {
    new_version.update_persisted_segment_meta(meta);
}

version_manager_->apply(new_version);
version_manager_->flush();

This sequence ensures that:

  • Readers always see a consistent schema version
  • Partial failures leave the previous version intact
  • The writing segment is recreated with the new schema before accepting new data

The SegmentManager uses configurable concurrency (defaulting to hardware threads) to parallelize segment updates, controlled via AddColumnOptions defined in src/include/zvec/db/options.h.

Practical Code Examples

Adding a Nullable Column

import zvec

# Open existing collection

col = zvec.Collection.open("/tmp/my_collection")

# Define nullable int32 field

field = zvec.FieldSchema(
    name="age",
    data_type=zvec.DataType.INT32,
    nullable=True,
    invert_index=False
)

# Add null-filled column (no expression)

status = col.add_column(field, expression="")
assert status.ok()

Adding a Computed Column


# Define computed field

field = zvec.FieldSchema(
    name="score_sum",
    data_type=zvec.DataType.FLOAT,
    nullable=False,
    invert_index=False
)

# Expression referencing existing columns

expr = "int_score + float_score"
status = col.add_column(field, expression=expr)
assert status.ok()

Renaming a Column


# Rename "age" to "user_age"

status = col.alter_column(column_name="age", rename="user_age")
assert status.ok()

Modifying Index Settings


# Add inverted index to existing "title" column

new_schema = zvec.FieldSchema(
    name="title",
    data_type=zvec.DataType.STRING,
    nullable=False,
    invert_index=True  # Enable inversion

)

status = col.alter_column(column_name="title",
                          new_column_schema=new_schema)
assert status.ok()

Summary

  • Immutable versioning ensures ZVec never modifies schemas in place; every evolution creates a new schema version applied atomically via VersionManager.
  • AddColumn supports both null-filled columns and computed columns generated through Arrow expression evaluation over existing data.
  • AlterColumn handles renaming and field redefinition by rewriting scalar blocks while preserving existing data, optionally rebuilding inverted indexes.
  • Parallel execution across segments uses configurable concurrency in SegmentManager, ensuring efficient schema evolution on large collections.
  • Transactional safety guarantees that readers see consistent schemas and partial failures roll back to the previous valid version.

Frequently Asked Questions

How does ZVec ensure data consistency during schema evolution?

ZVec implements a transactional versioning system where schema changes are applied to a new Version object only after all segments successfully process the update. The VersionManager atomically switches to the new version and flushes the manifest to disk, ensuring readers always access a consistent schema state even if the operation fails midway.

Can I add a column with calculated values from existing fields?

Yes. ZVec's AddColumn operation accepts a SQL-like expression parameter that references existing scalar columns. The system parses this expression using ParseToExpression, builds a temporary Arrow dataset from existing blocks via ReadBlocksAsDataset, and evaluates the expression row-wise through EvaluateExpressionWithDataset to materialize the new column data.

What happens to existing data when I alter a column's definition?

When using AlterColumn to rename a field or change its properties, ZVec rewrites the scalar blocks containing that column while preserving the actual data values. The operation updates the Arrow field definition, writes new blocks via WriteColumnInBlocks, and optionally rebuilds inverted indexes if the index configuration changed, ensuring existing documents remain accessible under the new schema.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →