How Turso's B-Tree Storage Engine Stores and Retrieves Data
Turso stores all tables and indexes in fixed-size pages within a single SQLite-compatible file, using interior B-tree nodes for navigation and leaf nodes for row payload, with overflow pages chained for large records.
Turso is an edge-native database built on libSQL, the open-source fork of SQLite maintained by Turso (tursodatabase/turso). The Turso B-tree storage engine persists data in a single file split into fixed-size pages, implementing a traditional B-tree structure where interior nodes point to child pages and leaf nodes contain actual row data. This design maintains byte-for-byte compatibility with SQLite's on-disk format while integrating Turso-specific async I/O and MVCC support.
Page Layout and On-Disk Structure
All B-tree pages in Turso begin with a standardized header that defines the page type and cell distribution. In core/storage/btree.rs, the offset module defines the byte positions for each header field:
pub mod offset {
pub const BTREE_PAGE_TYPE: usize = 0; // u8
pub const BTREE_FIRST_FREEBLOCK: usize = 1; // u16
pub const BTREE_CELL_COUNT: usize = 3; // u16
pub const BTREE_CELL_CONTENT_AREA: usize = 5; // u16
pub const BTREE_FRAGMENTED_BYTES_COUNT: usize = 7; // u8
pub const BTREE_RIGHTMOST_PTR: usize = 8; // u32
}
The header layout follows this structure: a 1-byte page type identifier, a 2-byte offset to the first freeblock, a 2-byte cell count, a 2-byte pointer to the cell content area start, a 1-byte fragmented bytes count, and a 4-byte rightmost pointer for interior nodes.
Each page is either an interior node—which stores child-page pointers and optional index keys—or a leaf node containing the actual row payload. The usable space per page is calculated as page_size – reserved_space, as defined in DatabaseHeader::usable_space within sqlite3_ondisk.rs (lines 69-71).
Cell Structure and Payload Management
Data within pages is organized into cells, represented by the BTreeCell enum in core/storage/btree.rs (lines 724-796):
pub enum BTreeCell {
TableInteriorCell(TableInteriorCell),
TableLeafCell(TableLeafCell),
IndexInteriorCell(IndexInteriorCell),
IndexLeafCell(IndexLeafCell),
}
Table cells contain a 64-bit row ID (i64), while index cells store the indexed key bytes. For leaf cells, payloads may be local (fitting entirely within the page) or overflow (spanning multiple pages).
The function payload_overflows in sqlite3_ondisk.rs calculates the local storage threshold based on page type and usable page size. When a payload exceeds this limit, the engine stores the initial portion locally and writes the remainder to a chain of overflow pages. The last four bytes of the cell header store the page number of the first overflow page, creating a linked list that process_overflow_read (lines 1011-1148) traverses to reconstruct the complete record.
Reading Data: The BTreeCursor Implementation
All data retrieval flows through BTreeCursor, which maintains a stack of pages representing the path from root to current leaf. The cursor implements a seek operation (lines 5565-5605) that performs the following steps:
- Binary search on interior pages using
InteriorPageBinarySearchState(lines 5530-5538) to locate the correct child pointer. - Descent via
cell_interior_read_left_child_pageto navigate to the target child page. - Leaf traversal to extract the record payload from
TableLeafCellorIndexLeafCell. - Overflow resolution through
process_overflow_readwhenfirst_overflow_pageis present, following the chain until the full payload is assembled.
The cursor's stack: PageStack structure stores, for each level, the page reference, current cell index, and total cell count. This allows the engine to resume traversal after I/O spills without re-reading cached pages. The read_btree_cell function (lines 814-887) parses raw page buffers according to page type, decoding varint-encoded payload sizes and determining overflow status.
Writing Data and Page Balancing
Insertion and updates utilize a WriteState state machine (lines 217-222) that handles payload allocation and overflow page creation. When writing:
- The engine allocates payload space or rewrites existing cells.
- If the payload exceeds local limits, it creates overflow pages and links them via the four-byte pointer field.
- When pages become over-full, the engine triggers balancing operations (
BalanceState, lines 236-305) that redistribute cells across up to three sibling pages to maintain B-tree invariants.
The write path ensures that the retrieval logic remains unchanged—overflow chains are transparent to readers, with BTreeCursor automatically following links during seek operations.
Pager Architecture and Data Flow
The storage engine relies on the Pager abstraction in core/storage/pager.rs for page allocation and the page_cache.rs module for in-memory caching. When storing data:
Pager::allocate_pagecreates aPageRefcontaining a page ID and buffer, inserting it into the page cache.BTreeCursor::insertconstructsBTreeCellinstances and writes payloads, including overflow page generation when necessary.- Page headers (type, cell count, rightmost pointer) are updated via
PageContentmethods using theoffsetconstants.
During retrieval, BTreeCursor::seek performs binary search through interior nodes, descends to leaf pages, and returns ImmutableRecord instances (lines 2628-2630) to the query engine. The same traversal logic applies to index lookups, range scans, and deletions—always starting at the root, searching interior nodes, and reading leaf payloads with automatic overflow handling.
// Open a Turso connection
let db = turso::Database::open("my.db")?;
// Insert a row (payload may overflow)
db.execute("INSERT INTO users(name, data) VALUES (?, ?)",
[&"alice", &large_blob])?;
// Retrieve a row – cursor walks the B-tree and stitches overflow pages
let row: (i64, Vec<u8>) = db.query_row(
"SELECT rowid, data FROM users WHERE name = ?", [&"alice"],
)?;
Summary
- Turso stores data in fixed-size pages within a single SQLite-compatible file, using interior nodes for navigation and leaf nodes for payload storage.
- The page header format is defined in
core/storage/btree.rswith specific byte offsets for type, cell count, and fragmentation metadata. - Cells are represented by the
BTreeCellenum, with leaf cells supporting local storage or overflow chains linked via four-byte page pointers. - BTreeCursor handles all reads via binary search on interior pages and stack-based traversal, automatically following overflow chains through
process_overflow_read. - Writes use the
WriteStatemachine and triggerBalanceStateoperations when redistributing cells across sibling pages. - The
Pagerandpage_cache.rscomponents manage page allocation and caching, ensuring efficient I/O for B-tree operations.
Frequently Asked Questions
How does Turso handle large payloads that exceed page size?
When a payload exceeds the local size limit calculated by payload_overflows, Turso stores the initial portion in the leaf cell and writes the remainder to a chain of overflow pages. The cell header stores the first overflow page number in its last four bytes. During retrieval, process_overflow_read follows this linked list page-by-page, concatenating fragments until the complete record is reconstructed.
What is the difference between interior and leaf pages in Turso's B-tree?
Interior pages contain navigation pointers (4-byte child page numbers) and optional index keys, guiding the BTreeCursor toward target data via binary search. Leaf pages store the actual row payloads or index entries. Interior pages use the rightmost pointer field in the header for the final child pointer, while leaf pages contain the BTreeCell variants holding row IDs and user data.
How does the BTreeCursor maintain its position during traversal?
The cursor maintains a PageStack that records the path from root to current leaf, storing each level's page reference, current cell index, and total cell count. This stack-based approach allows the engine to resume traversal after asynchronous I/O operations without re-reading pages already present in the page_cache.rs cache.
Is Turso's B-tree implementation compatible with standard SQLite files?
Yes, Turso maintains byte-for-byte compatibility with SQLite's on-disk format. The page header definitions in core/storage/btree.rs and format logic in sqlite3_ondisk.rs follow SQLite specifications, allowing Turso to read standard SQLite files while adding asynchronous I/O and MVCC capabilities through the Pager abstraction.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →