# How ZVec's SQL Query Engine Works: From Filter Strings to Arrow Execution Plans

> Discover how ZVec's SQL query engine transforms filters into Arrow execution plans. Learn about its three-stage pipeline for efficient ANN search and optimized compute graphs.

- Repository: [Alibaba/zvec](https://github.com/alibaba/zvec)
- Tags: internals
- Published: 2026-02-16

---

**ZVec's SQL query engine compiles high-level vector search requests into optimized Apache Arrow compute graphs through a three-stage pipeline: parsing filters into SQLInfo trees, analyzing and rewriting into QueryInfo with intelligent index selection, and planning Arrow Acero execution nodes for efficient ANN search.**

The alibaba/zvec repository implements a high-performance vector database that processes SQL-like queries using Apache Arrow's compute engine. Understanding how ZVec's SQL query engine works reveals how the system optimizes vector similarity search with complex predicate filtering.

## Parsing and Normalization: From Text to SQLInfo

The query processing begins in `src/db/sqlengine/parser/zvec_parser.cc`, where the **ZVecParser** class converts textual filter strings into structured representations.

When a user provides a filter like `"age > 30 AND title LIKE '%engineer%'"`, the parser invokes ANTLR-generated grammar rules defined in [`zvec_sql_parser.h`](https://github.com/alibaba/zvec/blob/main/zvec_sql_parser.h) to build an abstract syntax tree (AST). The `ZVecParser::parse_filter()` method normalizes literals—trimming quotes, handling numeric conversions, and validating syntax.

The AST is then transformed into a **`SQLInfo`** object via `SQLInfoHelper::MessageToSQLInfo`. This object records the SQL type (`SELECT`, `INSERT`, etc.) and maintains a pointer to the top-level **base info** (typically a `SelectInfo` structure). This normalized form serves as the canonical input for the analysis phase.

## Analysis and Rewriting: Optimizing the Query Plan

The second stage occurs in `src/db/sqlengine/analyzer/query_analyzer.cc`, where the **QueryAnalyzer** class transforms `SQLInfo` into **`QueryInfo`**—a rich structure containing vector conditions, filter conditions, forward conditions, invert conditions, order-by clauses, and top-k parameters.

The `QueryAnalyzer::analyze()` method orchestrates several critical transformations:

1. **Vector condition extraction** via `check_and_convert_vector`: This verifies the vector field exists, extracts dense or sparse vector text, and populates `QueryInfo::QueryVectorCondInfo` with the query vector and search parameters.

2. **QueryNode tree construction** via `create_querynode_from_node`: This converts the `SelectInfo` tree into an executable `QueryNode` tree representing the logical query plan.

3. **Filter-vs-Invert decision** via `decide_filter_index_cond`: This rule-based optimizer inspects every predicate to determine execution strategy:
   - **Invertible predicates** (equality matches on indexed fields) become **invert-cond** candidates, leveraging inverted indexes for candidate narrowing.
   - **Forward-filter** predicates scan raw forward fields when no suitable index exists.
   - **Post-filter** predicates apply after vector search when `post_filter_topk` is configured.

The analyzer rejects unsupported constructs early—such as `OR` ancestry on vector clauses—preventing runtime failures.

## Planning and Execution: Building the Arrow Compute Pipeline

The final stage in `src/db/sqlengine/planner/query_planner.cc` converts `QueryInfo` into **`PlanInfo`**—a tree of Apache Arrow compute operators executable by Arrow Acero.

The `QueryPlanner::make_plan()` method constructs the execution graph through these steps:

1. **Scan node selection**: Based on `QueryInfo` flags, the planner chooses between:
   - **`VectorRecallNode`**: Executes approximate nearest-neighbor (ANN) algorithms like HNSW or IVF on the vector field.
   - **`InvertRecallNode`**: Uses inverted indexes to narrow candidate sets before vector search.
   - **`SegmentNode`**: Performs forward scans on raw segment data when indexes are unavailable.

2. **Expression compilation**: The planner builds Arrow **`cp::Expression`** objects from the filter tree via `create_filter_node`. These compile into Arrow kernels—such as `is_in`, `list_value_length`, and custom `contain_all/any` operations—and attach to scan nodes.

3. **Pipeline construction**: The final **execution graph** forms an Acero pipeline:
   ```

   SegmentNode → (optional) InvertRecallNode → VectorRecallNode → FilterOps → FetchVectorOp → Project
   ```

4. **Execution**: `PlanInfo::execute_to_reader()` launches the pipeline, returning an Arrow `RecordBatchReader` that streams results as `RecordBatch` objects.

## Result Materialization

After execution, `SQLEngineImpl::fill_result` iterates over the `RecordBatchReader`, allocating a `Doc` object for each row. Type-specific helpers like `fill_doc_vector<float>` and `fill_doc_field<arrow::Int64Array>` copy Arrow column data into ZVec's internal `Doc` representation.

The materialization process attaches **doc-id**, **score**, **user-id**, and any selected fields, ultimately returning a `DocPtrList` to the caller.

## Code Examples

### Python: Simple Vector Search

```python
import zvec
from zvec import CollectionSchema, VectorSchema, DataType, HnswQueryParam

# 1. Initialise the library (once per process)

zvec.init(log_type=zvec.LogType.CONSOLE, log_level=zvec.LogLevel.INFO)

# 2. Define a collection schema with a 128-dimensional FP32 vector field

schema = CollectionSchema(
    name="my_collection",
    vectors=[VectorSchema(name="emb", dimension=128,
                         data_type=DataType.VECTOR_FP32)]
)

# 3. Open the collection (assumes it already exists)

coll = zvec.open("./my_collection", schema)

# 4. Build a VectorQuery (dense FP32 vector)

query = zvec.VectorQuery(
    field_name="emb",
    vector=[0.1] * 128,               # 128-dim vector

    param=HnswQueryParam(k=10)        # top-10 ANN

)

# 5. Execute the query

results = coll.query(query)            # returns List[Doc]

for doc in results:
    print(f"doc_id={doc.doc_id}, score={doc.score}")

```

Behind the scenes, `coll.query` invokes the C++ `SQLEngineImpl::execute`, which runs the three-stage pipeline described above.

### Python: SQL-like Filter with Vector Search

```python

# Same collection as before

filter_str = "age >= 30 AND title LIKE '%engineer%'"
query = zvec.VectorQuery(
    field_name="emb",
    vector=[0.2] * 128,
    filter=filter_str,                 # textual filter gets parsed by ZVecParser

    param=HnswQueryParam(k=5)
)

results = coll.query(query)

# The filter is transformed into Arrow expressions and applied

# before/after the ANN search depending on index availability.

```

### C++: Direct Engine Usage

```cpp
#include <zvec/db/sqlengine/sqlengine.h>
#include <zvec/db/doc.h>

using namespace zvec::sqlengine;

int main() {
    // 1. Create engine (profiler is optional)
    auto engine = SQLEngine::create(nullptr);

    // 2. Prepare collection schema & vector query
    CollectionSchema::Ptr coll = ...;          // obtained from DB metadata
    VectorQuery query;
    query.field_name_ = "emb";
    query.query_vector_ = std::make_shared<std::vector<float>>(128, 0.3f);
    query.topk_ = 10;
    query.filter_ = "category = 'books'";

    // 3. Load segments (each segment = a data file)
    std::vector<Segment::Ptr> segs = LoadSegments(...);

    // 4. Execute
    auto res = engine->execute(coll, query, segs);
    if (!res) { /* handle error */ }
    for (auto &doc_ptr : res.value()) {
        std::cout << "doc_id=" << doc_ptr->doc_id()
                  << " score=" << doc_ptr->score() << "\n";
    }
}

```

## Key Source Files

| File | Purpose |
|------|---------|
| [`src/db/sqlengine/sqlengine.h`](https://github.com/alibaba/zvec/blob/main/src/db/sqlengine/sqlengine.h) | Abstract `SQLEngine` interface |
| [`src/db/sqlengine/sqlengine_impl.h`](https://github.com/alibaba/zvec/blob/main/src/db/sqlengine/sqlengine_impl.h) | Concrete implementation (`SQLEngineImpl`) |
| `src/db/sqlengine/sqlengine_impl.cc` | Core orchestration: parsing, planning, result materialisation |
| `src/db/sqlengine/parser/zvec_parser.cc` | ANTLR-based filter parser → `SQLInfo` |
| `src/db/sqlengine/analyzer/query_analyzer.cc` | Transforms `SQLInfo` into `QueryInfo`, decides index/forward/post filters |
| [`src/db/sqlengine/planner/query_planner.h`](https://github.com/alibaba/zvec/blob/main/src/db/sqlengine/planner/query_planner.h) & `.cc` | Generates Arrow execution plan (`PlanInfo`) |
| [`src/db/sqlengine/planner/vector_recall_node.h`](https://github.com/alibaba/zvec/blob/main/src/db/sqlengine/planner/vector_recall_node.h) | Vector ANN recall node (HNSW, IVF, etc.) |
| [`src/db/sqlengine/planner/invert_recall_node.h`](https://github.com/alibaba/zvec/blob/main/src/db/sqlengine/planner/invert_recall_node.h) | Inverted-index based candidate narrowing |
| [`src/db/sqlengine/planner/segment_node.h`](https://github.com/alibaba/zvec/blob/main/src/db/sqlengine/planner/segment_node.h) | Reads a segment and produces Arrow record batches |
| `src/db/sqlengine/planner/ops/*` | Arrow compute operators for `IN`, `LIKE`, `CONTAIN` etc. |

These files together constitute the **SQL query engine** that turns a textual filter and a vector query into an optimized Arrow compute pipeline, enabling fast ANN search with optional predicate push-down.

## Summary

- **ZVec's SQL query engine** processes vector search requests through a three-stage pipeline: parsing, analysis, and planning/execution.
- The **ZVecParser** in `zvec_parser.cc` converts filter strings into `SQLInfo` trees using ANTLR-generated grammars.
- The **QueryAnalyzer** in `query_analyzer.cc` transforms `SQLInfo` into `QueryInfo`, extracting vector conditions and deciding between inverted-index filters, forward scans, and post-filters.
- The **QueryPlanner** in `query_planner.cc` builds a `PlanInfo` execution graph using Arrow Acero operators, combining `VectorRecallNode`, `InvertRecallNode`, and `SegmentNode` into a streaming pipeline.
- Results are materialized from Arrow `RecordBatchReader` into ZVec's `Doc` objects via type-specific helpers.

## Frequently Asked Questions

### How does ZVec parse SQL-like filter strings?

ZVec uses an ANTLR-generated grammar defined in `src/db/sqlengine/parser/zvec_parser.cc` to tokenize and parse filter strings. The **ZVecParser** class converts the raw text into an abstract syntax tree (AST), then normalizes literals and builds a `SQLInfo` object that represents the query structure. This process handles complex expressions including `LIKE`, `IN`, and logical operators while validating syntax.

### What is the difference between invert-cond and forward-filter in ZVec?

**Invert-cond** predicates are those that can leverage ZVec's inverted indexes, typically equality matches on indexed fields that narrow the candidate set before vector search. **Forward-filter** predicates scan raw forward field data when no suitable index exists, reading values directly from segments. The **QueryAnalyzer** in `query_analyzer.cc` makes this decision via `decide_filter_index_cond`, routing predicates to the most efficient execution path based on index availability and predicate type.

### How does ZVec execute vector similarity search?

Vector similarity search executes through the **VectorRecallNode** defined in [`src/db/sqlengine/planner/vector_recall_node.h`](https://github.com/alibaba/zvec/blob/main/src/db/sqlengine/planner/vector_recall_node.h). During the planning phase, the **QueryPlanner** instantiates this node with the query vector and ANN parameters (such as HNSW or IVF configurations). At execution time, Arrow Acero streams data through the node, which performs approximate nearest neighbor search against the indexed vector segments, returning top-k candidates with similarity scores.

### Can I use ZVec's SQL engine directly from C++?

Yes, the C++ API in [`src/db/sqlengine/sqlengine.h`](https://github.com/alibaba/zvec/blob/main/src/db/sqlengine/sqlengine.h) exposes the **SQLEngine** interface for direct integration. You can create an engine instance via `SQLEngine::create()`, prepare a `VectorQuery` with field names, query vectors, and filter strings, then call `engine->execute()` with your collection schema and segment list. This returns a `DocPtrList` containing document IDs, scores, and field values without requiring the Python wrapper.