# How ZVec Handles Schema Evolution: AddColumn and AlterColumn Implementation

> Learn how ZVec manages schema evolution using AddColumn and AlterColumn. Discover its versioned immutable schema approach and efficient parallel propagation for seamless updates.

- Repository: [Alibaba/zvec](https://github.com/alibaba/zvec)
- Tags: internals
- Published: 2026-02-16

---

**ZVec handles schema evolution by treating collection schemas as versioned immutable objects, where both AddColumn and AlterColumn operations create new schema versions, flush existing writing segments, and propagate changes across all persisted segments through parallel execution before atomically updating the version manifest.**

ZVec, Alibaba's high-performance vector database, implements robust **schema evolution** mechanisms that allow users to modify collection structures without downtime. Understanding how `add_column` and `alter_column` work internally reveals the system's transactional guarantees and Arrow-based execution engine. This article examines the implementation details found in the `alibaba/zvec` repository, including specific file paths and function signatures that power these schema modifications.

## Understanding ZVec Schema Evolution

ZVec approaches **schema evolution** through an immutable versioning strategy. When you invoke `AddColumn` or `AlterColumn`, the system does not modify the existing schema in place. Instead, it follows a strict pipeline:

1. **Validation** at the collection level checks type compatibility, nullability constraints, and expression syntax.
2. **Schema cloning** creates a new `CollectionSchema` object incorporating the requested changes.
3. **Segment flushing** ensures the current writing segment (if it contains data) is persisted and sealed.
4. **Parallel propagation** applies the schema change to every persisted segment via `SegmentManager`.
5. **Atomic version update** commits the new schema to the `VersionManager` manifest, making it visible to all readers.

Both operations share this pipeline, differing only in how column data is generated and validated.

## The AddColumn Operation

The `AddColumn` operation in ZVec supports two distinct modes: creating a null-filled column or generating a computed column through expression evaluation. The entry point resides in [`src/include/zvec/db/collection.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/collection.h), with the core implementation in `src/db/collection.cc`.

### Null Columns vs Computed Columns

When adding a column, ZVec determines the data source based on whether an expression is provided:

- **Null column**: If no expression is provided (empty string), ZVec creates a column filled with null values. This requires the field to be marked as nullable in the schema.
- **Computed column**: If a SQL-like expression is provided (e.g., `"int_score + float_score"`), ZVec evaluates this expression against existing scalar columns using Arrow's computation engine.

### Implementation in SegmentImpl

The `SegmentImpl::add_column` method in `src/db/index/segment/segment.cc` (lines 36-84 and 94-131) handles the physical column creation:

```cpp
Status SegmentImpl::add_column(FieldSchema::Ptr column_schema,
                               const std::string &expression,
                               const AddColumnOptions & /*options*/) {
  // Reject in-memory segments (cannot rewrite on-disk blocks)
  if (memory_store_) {
    return Status::NotSupported(
        "Add column is not supported for segment with memory store");
  }

  // Convert collection schema to Arrow fields
  std::vector<std::shared_ptr<arrow::Field>> fields;
  ConvertCollectionSchemaToArrowFields(collection_schema_, &fields);
  auto physic_schema = std::make_shared<arrow::Schema>(fields);

  // Prepare Arrow field for new column
  std::shared_ptr<arrow::Field> arrow_field;
  ConvertFieldSchemaToArrowField(column_schema.get(), &arrow_field);

  // Build new column data
  std::shared_ptr<arrow::ChunkedArray> new_column;
  if (expression.empty()) {
    // Create null column
    arrow::Result<std::shared_ptr<arrow::Array>> result =
        arrow::MakeArrayOfNull(arrow_field->type(),
                               scalar_blocks[0].doc_count_);
    new_column = std::make_shared<arrow::ChunkedArray>(std::vector{
        result.ValueOrDie()});
  } else {
    // Evaluate expression over existing columns
    auto p_result = ParseToExpression(expression, physic_schema);
    auto dataset = ReadBlocksAsDataset(...);
    auto eval_result = EvaluateExpressionWithDataset(dataset,
                         column_schema->name(), expr, arrow_field->type());
    new_column = eval_result.ValueOrDie()->column(0);
  }

  // Write column into new scalar blocks
  WriteColumnInBlocks(column_schema->name(), new_column,
                      filter_column_blocks, path_, segment_meta_->id(),
                      [this]() { return allocate_block_id(); },
                      !options_.enable_mmap_, &new_blocks);

  // Rebuild inverted index if needed
  if (column_schema->has_invert_index()) {
    reopen_invert_indexer();
    invert_indexers_->create_column_indexer(*column_schema);
  }

  // Update segment metadata
  segment_meta_->add_column_meta(column_schema->name(),
                                 column_schema->has_invert_index(),
                                 /* other meta fields */);
  return Status::OK();
}

```

## The AlterColumn Operation

While `AddColumn` creates new fields, `AlterColumn` modifies existing ones through the `CollectionImpl::AlterColumn` method in `src/db/collection.cc`. This operation supports renaming columns or changing their definitions (type, index settings, etc.).

### Renaming and Redefining Fields

The `SegmentImpl::alter_column` implementation in `src/db/index/segment/segment.cc` (lines 216-226) handles both scenarios:

```cpp
Status SegmentImpl::alter_column(const std::string &column_name,
                                 const FieldSchema::Ptr &new_column_schema,
                                 const AlterColumnOptions & /*options*/) {
  // Validation omitted for brevity
  
  // 1. Replace Arrow field definition in the collection schema
  // 2. Rewrite every scalar block that contains the column
  //    (similar WriteColumnInBlocks flow as add_column)
  // 3. Re-build inverted index if needed
  // 4. Update segment meta information
  return Status::OK();
}

```

Unlike `AddColumn`, which generates new data, `AlterColumn` preserves existing data while rewriting blocks to reflect new metadata (such as field names or index configurations).

## Transactional Guarantees and Version Management

ZVec ensures **atomic schema evolution** through its `VersionManager`. After `SegmentManager` parallelizes the operation across all segments, `CollectionImpl` constructs a new version:

```cpp
Version new_version = version_manager_->get_current_version();
new_version.set_schema(*new_schema);
new_version.reset_writing_segment_meta(writing_segment_->meta());

for (auto meta : segment_manager_->get_segments_meta()) {
    new_version.update_persisted_segment_meta(meta);
}

version_manager_->apply(new_version);
version_manager_->flush();

```

This sequence ensures that:
- Readers always see a consistent schema version
- Partial failures leave the previous version intact
- The writing segment is recreated with the new schema before accepting new data

The `SegmentManager` uses configurable concurrency (defaulting to hardware threads) to parallelize segment updates, controlled via `AddColumnOptions` defined in [`src/include/zvec/db/options.h`](https://github.com/alibaba/zvec/blob/main/src/include/zvec/db/options.h).

## Practical Code Examples

### Adding a Nullable Column

```python
import zvec

# Open existing collection

col = zvec.Collection.open("/tmp/my_collection")

# Define nullable int32 field

field = zvec.FieldSchema(
    name="age",
    data_type=zvec.DataType.INT32,
    nullable=True,
    invert_index=False
)

# Add null-filled column (no expression)

status = col.add_column(field, expression="")
assert status.ok()

```

### Adding a Computed Column

```python

# Define computed field

field = zvec.FieldSchema(
    name="score_sum",
    data_type=zvec.DataType.FLOAT,
    nullable=False,
    invert_index=False
)

# Expression referencing existing columns

expr = "int_score + float_score"
status = col.add_column(field, expression=expr)
assert status.ok()

```

### Renaming a Column

```python

# Rename "age" to "user_age"

status = col.alter_column(column_name="age", rename="user_age")
assert status.ok()

```

### Modifying Index Settings

```python

# Add inverted index to existing "title" column

new_schema = zvec.FieldSchema(
    name="title",
    data_type=zvec.DataType.STRING,
    nullable=False,
    invert_index=True  # Enable inversion

)

status = col.alter_column(column_name="title",
                          new_column_schema=new_schema)
assert status.ok()

```

## Summary

- **Immutable versioning** ensures ZVec never modifies schemas in place; every evolution creates a new schema version applied atomically via `VersionManager`.
- **AddColumn** supports both null-filled columns and computed columns generated through Arrow expression evaluation over existing data.
- **AlterColumn** handles renaming and field redefinition by rewriting scalar blocks while preserving existing data, optionally rebuilding inverted indexes.
- **Parallel execution** across segments uses configurable concurrency in `SegmentManager`, ensuring efficient schema evolution on large collections.
- **Transactional safety** guarantees that readers see consistent schemas and partial failures roll back to the previous valid version.

## Frequently Asked Questions

### How does ZVec ensure data consistency during schema evolution?

ZVec implements a **transactional versioning** system where schema changes are applied to a new `Version` object only after all segments successfully process the update. The `VersionManager` atomically switches to the new version and flushes the manifest to disk, ensuring readers always access a consistent schema state even if the operation fails midway.

### Can I add a column with calculated values from existing fields?

Yes. ZVec's `AddColumn` operation accepts a SQL-like **expression** parameter that references existing scalar columns. The system parses this expression using `ParseToExpression`, builds a temporary Arrow dataset from existing blocks via `ReadBlocksAsDataset`, and evaluates the expression row-wise through `EvaluateExpressionWithDataset` to materialize the new column data.

### What happens to existing data when I alter a column's definition?

When using `AlterColumn` to rename a field or change its properties, ZVec **rewrites the scalar blocks** containing that column while preserving the actual data values. The operation updates the Arrow field definition, writes new blocks via `WriteColumnInBlocks`, and optionally rebuilds inverted indexes if the index configuration changed, ensuring existing documents remain accessible under the new schema.