# Implementing Full-Text Search with Tantivy in Turso: A Technical Deep Dive

> Learn to implement full-text search with Tantivy in Turso. Discover how Turso integrates Tantivy using a hybrid B-Tree directory for efficient data retrieval.

- Repository: [Turso Database/turso](https://github.com/tursodatabase/turso)
- Tags: deep-dive
- Published: 2026-06-22

---

**Turso implements native full-text search by embedding the Tantivy search library into its storage engine, using a hybrid B-Tree directory architecture that caches hot files in memory while streaming large segments from disk.**

Turso, the edge-compatible SQLite database, ships with a native full-text search (FTS) engine built on top of the Tantivy search library. Implementing full-text search with Tantivy in Turso involves three tightly-coupled architectural layers that bridge Rust's type-safe search primitives with Turso's B-Tree storage backend. This implementation lives primarily in [`core/index_method/fts.rs`](https://github.com/tursodatabase/turso/blob/main/core/index_method/fts.rs) and provides SQLite-compatible `CREATE INDEX ... USING fts` semantics while maintaining Tantivy's high-performance indexing capabilities.

## Architecture Overview

Turso's FTS implementation consists of three core layers that work together to provide fast text indexing and querying.

### Tokenizer and Helper Functions

The foundation provides fast, reusable tokenization utilities that function even without an FTS index. The `fts_highlight` and `fts_match` functions (lines 81‑90 and 101‑110 of [`core/index_method/fts.rs`](https://github.com/tursodatabase/turso/blob/main/core/index_method/fts.rs)) offer stand-alone text processing capabilities.

A thread-local `TextAnalyzer` cached as `FTS_TOKENIZER` is reused for every call, avoiding the cost of creating a new analyzer each time (lines 66‑74). This zero-allocation approach ensures that tokenization overhead remains minimal during high-throughput operations.

### Hybrid B-Tree Directory

The `HybridBTreeDirectory` struct adapts Tantivy's `Directory` abstraction to Turso's B-Tree storage (lines 74‑106 and 171‑190). This implementation employs a two-tier caching strategy:

- **Hot-cache**: Small metadata files and term dictionaries remain fully resident in memory
- **Chunk-cache**: Large segment files are split into 1 MiB chunks (`DEFAULT_CHUNK_SIZE`) stored as separate B-Tree rows

When Tantivy requests file access, `get_file_handle` checks the hot-cache first, then pending writes, and finally lazily loads required chunks via `LazyFileHandle`. The `get_chunks_range_blocking` routine (lines 656‑704) seeks once and reads contiguous chunk ranges, dramatically reducing I/O overhead for large index segments.

### FTS Index Method and Attachment

The `FtsIndexMethod` implements Turso's `IndexMethod` trait (lines 1290‑1296), connecting the directory abstraction to Turso's index-method plumbing. When processing `CREATE INDEX ... USING fts`, the method creates a `FtsIndexAttachment` (lines 1318‑1350) that holds:

- The Tantivy `Schema` and `IndexReader`
- Field-to-column mappings
- Query patterns and the shared cached directory

The attachment parses `WITH` clause options—including `tokenizer` and `weights`—during the `attach` phase (lines 1352‑1410), validating tokenizer names against `SUPPORTED_TOKENIZERS` and configuring per-field boost factors.

## How Full-Text Search Works in Turso

Understanding the data flow reveals how Turso maintains consistency and performance across the search lifecycle.

### Tokenizer Caching

Every text operation uses the thread-local `FTS_TOKENIZER` instance, eliminating allocation overhead during repeated queries. This cache persists across SQL statements within the same connection.

### Directory I/O Flow

When Tantivy reads index files, `HybridBTreeDirectory::get_file_handle` implements a three-tier lookup:

1. Check the in-memory hot-cache for small metadata files
2. Check the pending-writes hashmap for uncommitted changes
3. Load 1 MiB chunks from the B-Tree via `get_chunks_range_blocking`

This hybrid approach keeps memory usage bounded (under approximately 200 MiB by default) while ensuring critical term dictionaries remain resident.

### Write Path Consistency

When Tantivy closes a file during segment creation, `HybridWriter::terminate_ref` (lines 443‑461) updates the in-memory catalog and pushes data to the hot-cache. Pending writes are immediately visible to subsequent reads through the hashmap, ensuring newly created segments can be searched without waiting for an asynchronous B-Tree flush.

### Index Creation

When executing `CREATE INDEX … USING fts`, Turso:

1. Parses the `WITH` clause for tokenizer and weight options
2. Validates the tokenizer against `SUPPORTED_TOKENIZERS`
3. Creates a Tantivy `Schema` with a fast-rowid field and text fields using the selected tokenizer
4. Stores the field weights for query-time boosting

### Query Execution

For `MATCH` queries, the `FtsIndexAttachment` creates a fresh `HybridBTreeDirectory` via `clone_with_fresh_pending` to isolate concurrent writers. It builds a `tantivy::query::QueryParser` for the index's text fields, compiles the user's expression into a Tantivy `Query`, and executes via `Searcher::search` using a `TopDocs` collector. The resulting `DocAddress` values are mapped back to Turso rowids.

## Code Examples

### Highlighting Search Terms Without an Index

Use `fts_highlight` to wrap matching terms in HTML tags without creating an FTS index:

```rust
use turso_core::index_method::fts::fts_highlight;

let highlighted = fts_highlight(
    "The quick brown fox jumps over the lazy dog",
    "quick fox",
    "<b>", "</b>",
);
assert_eq!(
    highlighted,
    "The <b>quick</b> brown <b>fox</b> jumps over the lazy dog"
);

```

This function tokenizes both the query and text using the cached `FTS_TOKENIZER` (lines 81‑90).

### Testing Text Matches

The `fts_match` function provides boolean matching for simple use cases:

```rust
use turso_core::index_method::fts::fts_match;

assert!(fts_match("hello world", "world"));
assert!(!fts_match("hello world", "planet"));

```

Implementation returns `true` if any query token appears in the target string (lines 55‑60).

### Creating an FTS Index

Define full-text indexes using standard SQL with Turso-specific extensions:

```sql
CREATE INDEX idx_article_body
USING fts (title, body)
WITH (tokenizer='default', weights='title=2.0,body=1.0');

```

This triggers `FtsIndexMethod::attach`, which builds a Tantivy `Schema` with boosted title fields (lines 1352‑1410).

### Executing MATCH Queries

Query indexed content using the `MATCH` operator:

```sql
SELECT rowid, title
FROM articles
WHERE body MATCH 'rust programming';

```

Runtime flow involves parsing the `Match` predicate into a Tantivy query, executing via `Searcher::search`, and converting `DocAddress` results to rowids.

### Accessing the Hybrid Directory

Advanced users can interact with the underlying storage layer:

```rust
use turso_core::index_method::fts::HybridBTreeDirectory;
use tantivy::directory::Directory;

let dir: &dyn Directory = &*cursor
    .cached_directory_state
    .read()
    .as_ref()
    .unwrap()
    .directory;

for entry in dir.list_all()? {
    println!("File: {}", entry.path.display());
}

```

The `HybridBTreeDirectory` implements the complete `Directory` trait including `list_all`, `exists`, and `open_write` (starting at line 690).

## Key Source Files

- **[`core/index_method/fts.rs`](https://github.com/tursodatabase/turso/blob/main/core/index_method/fts.rs)**: Core implementation including tokenizer helpers, `HybridBTreeDirectory`, `FtsIndexMethod`, and `FtsIndexAttachment`
- **[`core/index_method/mod.rs`](https://github.com/tursodatabase/turso/blob/main/core/index_method/mod.rs)**: Registers `FtsIndexMethod` under the name `"fts"` for parser resolution
- **[`parser/src/parser.rs`](https://github.com/tursodatabase/turso/blob/main/parser/src/parser.rs)**: Handles the `MATCH` operator and `TK_MATCH` token parsing
- **[`tests/integration/index_method/mod.rs`](https://github.com/tursodatabase/turso/blob/main/tests/integration/index_method/mod.rs)**: Integration tests for FTS creation, write-and-search cycles, and utility functions
- **`core/vector/operations/*`**: Vector similarity functions that combine with FTS for hybrid search

## Summary

- **Hybrid Architecture**: Turso implements Tantivy's `Directory` trait via `HybridBTreeDirectory`, splitting large files into 1 MiB chunks while keeping metadata in memory
- **Zero-Copy Reads**: The hot-cache and chunk-cache return `Arc<[u8]>` backed data, eliminating unnecessary copies when Tantivy reads postings
- **SQL Compatibility**: Full-text search uses standard SQLite syntax (`CREATE INDEX ... USING fts` and `MATCH`) with extended `WITH` clause options for tokenizers and weights
- **Thread-Local Optimization**: The `FTS_TOKENIZER` cache provides zero-allocation tokenization for helper functions like `fts_highlight` and `fts_match`
- **Consistency**: Pending writes are visible immediately through the in-memory hashmap, ensuring new segments are searchable before B-Tree flush

## Frequently Asked Questions

### How does Turso's full-text search differ from SQLite's FTS5?

Turso uses Tantivy rather than SQLite's built-in FTS5 module. This provides a Rust-native implementation with better memory safety and performance characteristics, while maintaining SQL compatibility through the `MATCH` operator and `CREATE INDEX ... USING fts` syntax. The `HybridBTreeDirectory` adapter allows Tantivy to run atop Turso's existing storage engine rather than requiring separate virtual tables.

### What tokenizers are supported in Turso's FTS implementation?

Turso validates tokenizer names against the `SUPPORTED_TOKENIZERS` constant during index creation (lines 1352‑1410). The default tokenizer uses Tantivy's standard text analyzer with case folding and unicode-aware tokenization. Custom tokenizers can be specified via the `WITH (tokenizer='...')` clause when creating the index.

### Can I combine full-text search with vector similarity search in Turso?

Yes. The `core/vector/operations/*` modules provide vector similarity functions (such as `distance_cos`) that can be combined with FTS results. You can filter candidates using `MATCH` predicates and then rank by vector similarity, or use both scores in a hybrid ranking formula within the same SQL query.

### How does the hybrid directory manage memory usage?

The `HybridBTreeDirectory` implements a two-tier caching system that bounds memory usage to approximately 200 MiB by default. Small metadata files and term dictionaries remain in the hot-cache, while large segment files are streamed on-demand from 1 MiB chunks stored in the B-Tree. The `chunk_cache` uses an LRU eviction policy, and the `clone_with_fresh_pending` method ensures concurrent queries don't retain unbounded pending writes.