# How zvec's Column Merging Reader Unifies Multi-Column Queries Across Segments

> Learn how zvecs Column Merging Reader unifies multi column queries by horizontally merging per segment column streams into a single logical view for efficient data access.

- Repository: [Alibaba/zvec](https://github.com/alibaba/zvec)
- Tags: internals
- Published: 2026-02-16

---

**zvec's `ColumnMergingReader` is an Arrow-compatible `RecordBatchReader` that horizontally merges per-segment column streams into a single logical view, enabling SQL queries to read columns stored across different segment files as one unified record batch.**

When executing SQL queries that reference columns stored in different segment files, the Alibaba zvec database engine must present a unified row-aligned view to the query planner. The `ColumnMergingReader` class handles this by implementing the Arrow `RecordBatchReader` interface to merge multiple per-segment readers into a single stream matching the query's target schema.

## What Is the Column Merging Reader in zvec?

The `ColumnMergingReader` is a specialized reader implementation located in `src/db/index/segment/column_merging_reader.cc`. It acts as a consolidation layer that sits between the storage engine and the SQL execution engine. When a query requests columns that reside in separate physical segments—such as when columns are partitioned or stored in different files—this reader aggregates the disparate streams into one coherent `RecordBatch`.

Unlike simple concatenation, the merging reader performs a **horizontal join** of columns at the row level. It ensures that row N in column A aligns with row N in column B, even when these columns originate from different segment readers. This alignment is critical for maintaining ACID properties and correct query results across segmented storage.

## How ColumnMergingReader Handles Multi-Column Queries

The implementation follows a systematic eight-step process to transform multiple segment-specific readers into a unified output stream.

### Creating the Reader Instance

The factory method `ColumnMergingReader::Make` initializes the merging reader with the query's target schema and a vector of input readers. According to the source in `column_merging_reader.cc` lines 25-31, this method constructs a shared pointer instance after validating the inputs:

```cpp
auto merged_reader = zvec::ColumnMergingReader::Make(target_schema,
                                                    std::move(per_segment_readers));

```

### Reading and Validating Batches

The core logic resides in `ReadNext`, which implements the Arrow reader interface. As shown in lines 53-71 of `column_merging_reader.cc`, the method iterates through all input readers, calling `ReadNext` on each to populate the `current_batches_` buffer:

```cpp
for (size_t i = 0; i < input_readers_.size(); ++i) {
    std::shared_ptr<arrow::RecordBatch> batch;
    ARROW_RETURN_NOT_OK(input_readers_[i]->ReadNext(&batch));
    current_batches_[i] = std::move(batch);
}

```

EOF detection occurs when all readers return `nullptr`, setting `has_more_ = false`.

Row alignment validation happens in lines 76-86. The code verifies that all non-null batches share the same row count. If batch 0 has 1000 rows but batch 1 has 999, the reader returns an `Invalid` Arrow status, preventing corrupted query results.

### Extracting and Assembling Columns

For each field in the target schema, the reader locates the corresponding column across the current batches. Lines 98-116 implement a name-based lookup using `GetFieldIndex`:

```cpp
for (int field_idx = 0; field_idx < target_schema_->num_fields(); ++field_idx) {
    const std::string& name = target_schema_->field(field_idx)->name();
    // Scan batches to find column by name
    for (const auto& batch : current_batches_) {
        if (batch) {
            int idx = batch->schema()->GetFieldIndex(name);
            if (idx != -1) {
                columns.push_back(batch->column(idx));
                break;
            }
        }
    }
}

```

Finally, lines 120-127 assemble the extracted arrays into a new `RecordBatch` using the target schema:

```cpp
*out = arrow::RecordBatch::Make(target_schema_, expected_rows, columns);

```

The temporary batch buffer is cleared in lines 128-130, preparing for the next iteration.

## Integration with the Query Engine

The `ColumnMergingReader` is instantiated by the segment-level planner when queries span multiple column files. In `src/db/index/segment/segment.cc` at lines 2702-2705, the code constructs the merging reader after collecting per-segment readers:

```cpp
auto merged_reader = ColumnMergingReader::Make(target_schema,
                                              std::move(readers));

```

This integration point demonstrates how zvec abstracts storage fragmentation from the SQL execution layer. The query engine receives a standard Arrow `RecordBatchReader` interface, unaware that underlying data resides in separate segment files.

## Summary

- **ColumnMergingReader** provides a unified Arrow `RecordBatchReader` interface for multi-column queries spanning different segment files in zvec.
- The reader performs **horizontal merging** at the row level, ensuring column alignment across segments through strict row-count validation in `ReadNext`.
- Implementation resides in `src/db/index/segment/column_merging_reader.cc`, with the factory method `Make` handling initialization and `ReadNext` managing the merge logic.
- The class integrates with the query engine via `segment.cc` (lines 2702-2705), presenting a transparent view of fragmented storage to the SQL execution layer.

## Frequently Asked Questions

### What is the primary purpose of ColumnMergingReader in zvec?

The primary purpose is to present a single logical view of data that is physically distributed across multiple segment files. When a SQL query requests columns stored in different segments, this reader merges the per-segment Arrow streams horizontally, ensuring row alignment without requiring the query engine to handle storage fragmentation details.

### How does ColumnMergingReader ensure data consistency across segments?

The reader enforces consistency through strict validation in the `ReadNext` method. After reading batches from all input readers, it verifies that every non-null batch contains the identical number of rows. If row counts differ, it returns an `Invalid` Arrow status, preventing the query engine from processing misaligned data that would produce incorrect results.

### What Arrow interface does ColumnMergingReader implement?

`ColumnMergingReader` implements the `arrow::RecordBatchReader` interface, which is the standard Arrow streaming API for reading record batches. This allows seamless integration with the zvec SQL execution engine and any other Arrow-compatible processing tools, as the merging reader appears as a standard stream of record batches matching the query's target schema.

### Where is ColumnMergingReader instantiated in the zvec codebase?

The reader is instantiated in `src/db/index/segment/segment.cc` at lines 2702-2705, within the segment-level query planning logic. This occurs when the planner detects that a query requires columns from multiple segment files, at which point it collects the individual segment readers and passes them to `ColumnMergingReader::Make` to create the unified reading interface.