deep-dive

Implementing Full-Text Search with Tantivy in Turso: A Technical Deep Dive

June 22, 2026 tursodatabase/turso ↗

Turso implements native full-text search by embedding the Tantivy search library into its storage engine, using a hybrid B-Tree directory architecture that caches hot files in memory while streaming large segments from disk.

Turso, the edge-compatible SQLite database, ships with a native full-text search (FTS) engine built on top of the Tantivy search library. Implementing full-text search with Tantivy in Turso involves three tightly-coupled architectural layers that bridge Rust's type-safe search primitives with Turso's B-Tree storage backend. This implementation lives primarily in core/index_method/fts.rs and provides SQLite-compatible CREATE INDEX ... USING fts semantics while maintaining Tantivy's high-performance indexing capabilities.

Architecture Overview

Turso's FTS implementation consists of three core layers that work together to provide fast text indexing and querying.

Tokenizer and Helper Functions

The foundation provides fast, reusable tokenization utilities that function even without an FTS index. The fts_highlight and fts_match functions (lines 81‑90 and 101‑110 of core/index_method/fts.rs) offer stand-alone text processing capabilities.

A thread-local TextAnalyzer cached as FTS_TOKENIZER is reused for every call, avoiding the cost of creating a new analyzer each time (lines 66‑74). This zero-allocation approach ensures that tokenization overhead remains minimal during high-throughput operations.

Hybrid B-Tree Directory

The HybridBTreeDirectory struct adapts Tantivy's Directory abstraction to Turso's B-Tree storage (lines 74‑106 and 171‑190). This implementation employs a two-tier caching strategy:

Hot-cache: Small metadata files and term dictionaries remain fully resident in memory
Chunk-cache: Large segment files are split into 1 MiB chunks (DEFAULT_CHUNK_SIZE) stored as separate B-Tree rows

When Tantivy requests file access, get_file_handle checks the hot-cache first, then pending writes, and finally lazily loads required chunks via LazyFileHandle. The get_chunks_range_blocking routine (lines 656‑704) seeks once and reads contiguous chunk ranges, dramatically reducing I/O overhead for large index segments.

FTS Index Method and Attachment

The FtsIndexMethod implements Turso's IndexMethod trait (lines 1290‑1296), connecting the directory abstraction to Turso's index-method plumbing. When processing CREATE INDEX ... USING fts, the method creates a FtsIndexAttachment (lines 1318‑1350) that holds:

The Tantivy Schema and IndexReader
Field-to-column mappings
Query patterns and the shared cached directory

The attachment parses WITH clause options—including tokenizer and weights—during the attach phase (lines 1352‑1410), validating tokenizer names against SUPPORTED_TOKENIZERS and configuring per-field boost factors.

How Full-Text Search Works in Turso

Understanding the data flow reveals how Turso maintains consistency and performance across the search lifecycle.

Tokenizer Caching

Every text operation uses the thread-local FTS_TOKENIZER instance, eliminating allocation overhead during repeated queries. This cache persists across SQL statements within the same connection.

Directory I/O Flow

When Tantivy reads index files, HybridBTreeDirectory::get_file_handle implements a three-tier lookup:

Check the in-memory hot-cache for small metadata files
Check the pending-writes hashmap for uncommitted changes
Load 1 MiB chunks from the B-Tree via get_chunks_range_blocking

This hybrid approach keeps memory usage bounded (under approximately 200 MiB by default) while ensuring critical term dictionaries remain resident.

Write Path Consistency

When Tantivy closes a file during segment creation, HybridWriter::terminate_ref (lines 443‑461) updates the in-memory catalog and pushes data to the hot-cache. Pending writes are immediately visible to subsequent reads through the hashmap, ensuring newly created segments can be searched without waiting for an asynchronous B-Tree flush.

Index Creation

When executing CREATE INDEX … USING fts, Turso:

Parses the WITH clause for tokenizer and weight options
Validates the tokenizer against SUPPORTED_TOKENIZERS
Creates a Tantivy Schema with a fast-rowid field and text fields using the selected tokenizer
Stores the field weights for query-time boosting

Query Execution

For MATCH queries, the FtsIndexAttachment creates a fresh HybridBTreeDirectory via clone_with_fresh_pending to isolate concurrent writers. It builds a tantivy::query::QueryParser for the index's text fields, compiles the user's expression into a Tantivy Query, and executes via Searcher::search using a TopDocs collector. The resulting DocAddress values are mapped back to Turso rowids.

Code Examples

Highlighting Search Terms Without an Index

Use fts_highlight to wrap matching terms in HTML tags without creating an FTS index:

use turso_core::index_method::fts::fts_highlight;

let highlighted = fts_highlight(
    "The quick brown fox jumps over the lazy dog",
    "quick fox",
    "<b>", "</b>",
);
assert_eq!(
    highlighted,
    "The <b>quick</b> brown <b>fox</b> jumps over the lazy dog"
);

This function tokenizes both the query and text using the cached FTS_TOKENIZER (lines 81‑90).

Testing Text Matches

The fts_match function provides boolean matching for simple use cases:

use turso_core::index_method::fts::fts_match;

assert!(fts_match("hello world", "world"));
assert!(!fts_match("hello world", "planet"));

Implementation returns true if any query token appears in the target string (lines 55‑60).

Creating an FTS Index

Define full-text indexes using standard SQL with Turso-specific extensions:

CREATE INDEX idx_article_body
USING fts (title, body)
WITH (tokenizer='default', weights='title=2.0,body=1.0');

This triggers FtsIndexMethod::attach, which builds a Tantivy Schema with boosted title fields (lines 1352‑1410).

Executing MATCH Queries

Query indexed content using the MATCH operator:

SELECT rowid, title
FROM articles
WHERE body MATCH 'rust programming';

Runtime flow involves parsing the Match predicate into a Tantivy query, executing via Searcher::search, and converting DocAddress results to rowids.

Accessing the Hybrid Directory

Advanced users can interact with the underlying storage layer:

use turso_core::index_method::fts::HybridBTreeDirectory;
use tantivy::directory::Directory;

let dir: &dyn Directory = &*cursor
    .cached_directory_state
    .read()
    .as_ref()
    .unwrap()
    .directory;

for entry in dir.list_all()? {
    println!("File: {}", entry.path.display());
}

The HybridBTreeDirectory implements the complete Directory trait including list_all, exists, and open_write (starting at line 690).

Key Source Files

core/index_method/fts.rs: Core implementation including tokenizer helpers, HybridBTreeDirectory, FtsIndexMethod, and FtsIndexAttachment
core/index_method/mod.rs: Registers FtsIndexMethod under the name "fts" for parser resolution
parser/src/parser.rs: Handles the MATCH operator and TK_MATCH token parsing
tests/integration/index_method/mod.rs: Integration tests for FTS creation, write-and-search cycles, and utility functions
core/vector/operations/*: Vector similarity functions that combine with FTS for hybrid search

Summary

Hybrid Architecture: Turso implements Tantivy's Directory trait via HybridBTreeDirectory, splitting large files into 1 MiB chunks while keeping metadata in memory
Zero-Copy Reads: The hot-cache and chunk-cache return Arc<[u8]> backed data, eliminating unnecessary copies when Tantivy reads postings
SQL Compatibility: Full-text search uses standard SQLite syntax (CREATE INDEX ... USING fts and MATCH) with extended WITH clause options for tokenizers and weights
Thread-Local Optimization: The FTS_TOKENIZER cache provides zero-allocation tokenization for helper functions like fts_highlight and fts_match
Consistency: Pending writes are visible immediately through the in-memory hashmap, ensuring new segments are searchable before B-Tree flush

Frequently Asked Questions

How does Turso's full-text search differ from SQLite's FTS5?

Turso uses Tantivy rather than SQLite's built-in FTS5 module. This provides a Rust-native implementation with better memory safety and performance characteristics, while maintaining SQL compatibility through the MATCH operator and CREATE INDEX ... USING fts syntax. The HybridBTreeDirectory adapter allows Tantivy to run atop Turso's existing storage engine rather than requiring separate virtual tables.

What tokenizers are supported in Turso's FTS implementation?

Turso validates tokenizer names against the SUPPORTED_TOKENIZERS constant during index creation (lines 1352‑1410). The default tokenizer uses Tantivy's standard text analyzer with case folding and unicode-aware tokenization. Custom tokenizers can be specified via the WITH (tokenizer='...') clause when creating the index.

Can I combine full-text search with vector similarity search in Turso?

Yes. The core/vector/operations/* modules provide vector similarity functions (such as distance_cos) that can be combined with FTS results. You can filter candidates using MATCH predicates and then rank by vector similarity, or use both scores in a hybrid ranking formula within the same SQL query.

How does the hybrid directory manage memory usage?

The HybridBTreeDirectory implements a two-tier caching system that bounds memory usage to approximately 200 MiB by default. Small metadata files and term dictionaries remain in the hot-cache, while large segment files are streamed on-demand from 1 MiB chunks stored in the B-Tree. The chunk_cache uses an LRU eviction policy, and the clone_with_fresh_pending method ensures concurrent queries don't retain unbounded pending writes.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how tursodatabase/turso works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →