Implementing Full-Text Search with Tantivy in Turso: A Technical Deep Dive
Turso implements native full-text search by embedding the Tantivy search library into its storage engine, using a hybrid B-Tree directory architecture that caches hot files in memory while streaming large segments from disk.
Turso, the edge-compatible SQLite database, ships with a native full-text search (FTS) engine built on top of the Tantivy search library. Implementing full-text search with Tantivy in Turso involves three tightly-coupled architectural layers that bridge Rust's type-safe search primitives with Turso's B-Tree storage backend. This implementation lives primarily in core/index_method/fts.rs and provides SQLite-compatible CREATE INDEX ... USING fts semantics while maintaining Tantivy's high-performance indexing capabilities.
Architecture Overview
Turso's FTS implementation consists of three core layers that work together to provide fast text indexing and querying.
Tokenizer and Helper Functions
The foundation provides fast, reusable tokenization utilities that function even without an FTS index. The fts_highlight and fts_match functions (lines 81‑90 and 101‑110 of core/index_method/fts.rs) offer stand-alone text processing capabilities.
A thread-local TextAnalyzer cached as FTS_TOKENIZER is reused for every call, avoiding the cost of creating a new analyzer each time (lines 66‑74). This zero-allocation approach ensures that tokenization overhead remains minimal during high-throughput operations.
Hybrid B-Tree Directory
The HybridBTreeDirectory struct adapts Tantivy's Directory abstraction to Turso's B-Tree storage (lines 74‑106 and 171‑190). This implementation employs a two-tier caching strategy:
- Hot-cache: Small metadata files and term dictionaries remain fully resident in memory
- Chunk-cache: Large segment files are split into 1 MiB chunks (
DEFAULT_CHUNK_SIZE) stored as separate B-Tree rows
When Tantivy requests file access, get_file_handle checks the hot-cache first, then pending writes, and finally lazily loads required chunks via LazyFileHandle. The get_chunks_range_blocking routine (lines 656‑704) seeks once and reads contiguous chunk ranges, dramatically reducing I/O overhead for large index segments.
FTS Index Method and Attachment
The FtsIndexMethod implements Turso's IndexMethod trait (lines 1290‑1296), connecting the directory abstraction to Turso's index-method plumbing. When processing CREATE INDEX ... USING fts, the method creates a FtsIndexAttachment (lines 1318‑1350) that holds:
- The Tantivy
SchemaandIndexReader - Field-to-column mappings
- Query patterns and the shared cached directory
The attachment parses WITH clause options—including tokenizer and weights—during the attach phase (lines 1352‑1410), validating tokenizer names against SUPPORTED_TOKENIZERS and configuring per-field boost factors.
How Full-Text Search Works in Turso
Understanding the data flow reveals how Turso maintains consistency and performance across the search lifecycle.
Tokenizer Caching
Every text operation uses the thread-local FTS_TOKENIZER instance, eliminating allocation overhead during repeated queries. This cache persists across SQL statements within the same connection.
Directory I/O Flow
When Tantivy reads index files, HybridBTreeDirectory::get_file_handle implements a three-tier lookup:
- Check the in-memory hot-cache for small metadata files
- Check the pending-writes hashmap for uncommitted changes
- Load 1 MiB chunks from the B-Tree via
get_chunks_range_blocking
This hybrid approach keeps memory usage bounded (under approximately 200 MiB by default) while ensuring critical term dictionaries remain resident.
Write Path Consistency
When Tantivy closes a file during segment creation, HybridWriter::terminate_ref (lines 443‑461) updates the in-memory catalog and pushes data to the hot-cache. Pending writes are immediately visible to subsequent reads through the hashmap, ensuring newly created segments can be searched without waiting for an asynchronous B-Tree flush.
Index Creation
When executing CREATE INDEX … USING fts, Turso:
- Parses the
WITHclause for tokenizer and weight options - Validates the tokenizer against
SUPPORTED_TOKENIZERS - Creates a Tantivy
Schemawith a fast-rowid field and text fields using the selected tokenizer - Stores the field weights for query-time boosting
Query Execution
For MATCH queries, the FtsIndexAttachment creates a fresh HybridBTreeDirectory via clone_with_fresh_pending to isolate concurrent writers. It builds a tantivy::query::QueryParser for the index's text fields, compiles the user's expression into a Tantivy Query, and executes via Searcher::search using a TopDocs collector. The resulting DocAddress values are mapped back to Turso rowids.
Code Examples
Highlighting Search Terms Without an Index
Use fts_highlight to wrap matching terms in HTML tags without creating an FTS index:
use turso_core::index_method::fts::fts_highlight;
let highlighted = fts_highlight(
"The quick brown fox jumps over the lazy dog",
"quick fox",
"<b>", "</b>",
);
assert_eq!(
highlighted,
"The <b>quick</b> brown <b>fox</b> jumps over the lazy dog"
);
This function tokenizes both the query and text using the cached FTS_TOKENIZER (lines 81‑90).
Testing Text Matches
The fts_match function provides boolean matching for simple use cases:
use turso_core::index_method::fts::fts_match;
assert!(fts_match("hello world", "world"));
assert!(!fts_match("hello world", "planet"));
Implementation returns true if any query token appears in the target string (lines 55‑60).
Creating an FTS Index
Define full-text indexes using standard SQL with Turso-specific extensions:
CREATE INDEX idx_article_body
USING fts (title, body)
WITH (tokenizer='default', weights='title=2.0,body=1.0');
This triggers FtsIndexMethod::attach, which builds a Tantivy Schema with boosted title fields (lines 1352‑1410).
Executing MATCH Queries
Query indexed content using the MATCH operator:
SELECT rowid, title
FROM articles
WHERE body MATCH 'rust programming';
Runtime flow involves parsing the Match predicate into a Tantivy query, executing via Searcher::search, and converting DocAddress results to rowids.
Accessing the Hybrid Directory
Advanced users can interact with the underlying storage layer:
use turso_core::index_method::fts::HybridBTreeDirectory;
use tantivy::directory::Directory;
let dir: &dyn Directory = &*cursor
.cached_directory_state
.read()
.as_ref()
.unwrap()
.directory;
for entry in dir.list_all()? {
println!("File: {}", entry.path.display());
}
The HybridBTreeDirectory implements the complete Directory trait including list_all, exists, and open_write (starting at line 690).
Key Source Files
core/index_method/fts.rs: Core implementation including tokenizer helpers,HybridBTreeDirectory,FtsIndexMethod, andFtsIndexAttachmentcore/index_method/mod.rs: RegistersFtsIndexMethodunder the name"fts"for parser resolutionparser/src/parser.rs: Handles theMATCHoperator andTK_MATCHtoken parsingtests/integration/index_method/mod.rs: Integration tests for FTS creation, write-and-search cycles, and utility functionscore/vector/operations/*: Vector similarity functions that combine with FTS for hybrid search
Summary
- Hybrid Architecture: Turso implements Tantivy's
Directorytrait viaHybridBTreeDirectory, splitting large files into 1 MiB chunks while keeping metadata in memory - Zero-Copy Reads: The hot-cache and chunk-cache return
Arc<[u8]>backed data, eliminating unnecessary copies when Tantivy reads postings - SQL Compatibility: Full-text search uses standard SQLite syntax (
CREATE INDEX ... USING ftsandMATCH) with extendedWITHclause options for tokenizers and weights - Thread-Local Optimization: The
FTS_TOKENIZERcache provides zero-allocation tokenization for helper functions likefts_highlightandfts_match - Consistency: Pending writes are visible immediately through the in-memory hashmap, ensuring new segments are searchable before B-Tree flush
Frequently Asked Questions
How does Turso's full-text search differ from SQLite's FTS5?
Turso uses Tantivy rather than SQLite's built-in FTS5 module. This provides a Rust-native implementation with better memory safety and performance characteristics, while maintaining SQL compatibility through the MATCH operator and CREATE INDEX ... USING fts syntax. The HybridBTreeDirectory adapter allows Tantivy to run atop Turso's existing storage engine rather than requiring separate virtual tables.
What tokenizers are supported in Turso's FTS implementation?
Turso validates tokenizer names against the SUPPORTED_TOKENIZERS constant during index creation (lines 1352‑1410). The default tokenizer uses Tantivy's standard text analyzer with case folding and unicode-aware tokenization. Custom tokenizers can be specified via the WITH (tokenizer='...') clause when creating the index.
Can I combine full-text search with vector similarity search in Turso?
Yes. The core/vector/operations/* modules provide vector similarity functions (such as distance_cos) that can be combined with FTS results. You can filter candidates using MATCH predicates and then rank by vector similarity, or use both scores in a hybrid ranking formula within the same SQL query.
How does the hybrid directory manage memory usage?
The HybridBTreeDirectory implements a two-tier caching system that bounds memory usage to approximately 200 MiB by default. Small metadata files and term dictionaries remain in the hot-cache, while large segment files are streamed on-demand from 1 MiB chunks stored in the B-Tree. The chunk_cache uses an LRU eviction policy, and the clone_with_fresh_pending method ensures concurrent queries don't retain unbounded pending writes.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →