How the DOM Dehydration Pipeline Converts Live DOM to LLM-Consumable Text
The DOM dehydration pipeline transforms a live browser DOM into a linear, indexed text representation by flattening the hierarchy, filtering redundant attributes, and assigning numeric highlight indices to interactive elements, enabling LLMs to reference and manipulate specific UI components using concise numeric identifiers.
The alibaba/page-agent repository implements a sophisticated browser automation system that bridges web pages and AI agents. At its core lies the DOM dehydration pipeline, which solves the fundamental problem of how large language models (LLMs) can perceive and reason about dynamic web content. This pipeline converts the complex, nested Document Object Model into a deterministic, text-based format that preserves essential interactive elements while removing visual noise.
Three-Stage DOM Dehydration Pipeline
The pipeline operates through three distinct phases, each implemented in packages/page-controller/src/dom/index.ts. This modular approach ensures that the live DOM state is captured, structured, and rendered into a format optimized for token-efficient LLM consumption.
Stage 1: DOM Extraction and Flattening
The process begins with getFlatTree(), which traverses the live document to build a FlatDomTree data structure. This function configures a low-level DOM extractor (domTree) with parameters for viewport expansion, element highlighting, and visibility filtering.
During this traversal, every interactive element receives a sequential highlightIndex (e.g., [0], [1], [2]) that serves as a stable reference for the LLM. The system maintains a WeakMap called newElementsCache to track whether an interactive node has been seen in previous snapshots. If a node is newly discovered, it receives an isNew flag, which later renders as a * prefix in the output to alert the LLM to fresh interactive elements.
// packages/page-controller/src/dom/index.ts
export function getFlatTree(config: DomConfig): FlatDomTree {
const elements = domTree({
doHighlightElements: true,
viewportExpansion: VIEWPORT_EXPANSION,
}) as FlatDomTree;
// Mark newly-seen interactive nodes
for (const nodeId in elements.map) {
const node = elements.map[nodeId];
if (node.isInteractive && node.ref && !newElementsCache.has(node.ref as HTMLElement)) {
newElementsCache.set(node.ref as HTMLElement, window.location.href);
node.isNew = true;
}
}
return elements;
}
Stage 2: Hierarchical Tree Construction
Once the flat map is generated, flatTreeToString() initiates the conversion to a hierarchical structure. The function buildTreeNode() recursively constructs TreeNode instances starting from flatTree.rootId, preserving parent-child relationships that were flattened during extraction.
The setParentReferences() function then attaches upward pointers to each node, enabling depth-first traversal while maintaining proper indentation levels in the final output. This step also tags scrollable containers and preserves metadata necessary for reconstructing the page's logical structure.
Stage 3: Textual Rendering and Attribute Filtering
The final transformation occurs in processNode(), which walks the hierarchical tree depth-first to generate the LLM-readable string. For interactive elements (those with a highlightIndex), the pipeline collects visible text up to the next clickable element using getAllTextTillNextClickableElement(), filters attributes against a whitelist (DEFAULT_INCLUDE_ATTRIBUTES), removes duplicates (such as role attributes matching tag names), and truncates long values via capTextLength().
The output format follows a strict convention: [index]<tag attributes>text</>. New elements display as *[index], while scrollable containers append data-scrollable annotations with coordinate data. Plain text nodes emit as indented content unless an ancestor already carries a highlight index, preventing duplicate content.
// packages/page-controller/src/dom/index.ts
export function flatTreeToString(
flatTree: FlatDomTree,
includeAttributes?: string[]
): string {
const rootNode = buildTreeNode(flatTree.rootId);
setParentReferences(rootNode);
const result: string[] = [];
processNode(rootNode, 0, result);
return result.join('\n');
}
Example output:
[0]<a aria-label=Home>Home</>
[1]<button>Login</>
Welcome to our site!
*[2]<input placeholder=Search />
PageController Integration and Action Mapping
The PageController class orchestrates the pipeline through its updateTree() method in packages/page-controller/src/PageController.ts. After generating the textual representation, it stores the result in this.simplifiedHTML and constructs two critical lookup tables:
selectorMap: Maps highlight indices to actual DOM nodes viadom.getSelectorMap(), enabling the system to resolve LLM actions ("click element 5") to concrete elements.elementTextMap: Maps indices to raw text lines viadom.getElementTextMap(), providing context for text input operations.
This architecture isolates the LLM from raw DOM manipulation, ensuring that the same page state always produces identical text output—a requirement for reproducible AI reasoning.
Summary
- The DOM dehydration pipeline in
alibaba/page-agentconverts live DOM into LLM-friendly text through three stages: flattening (getFlatTree), hierarchical reconstruction (buildTreeNode), and textual rendering (processNode). - Interactive elements receive highlight indices that serve as stable numeric references for LLM actions, with new elements marked by
*prefixes. - Attribute filtering and text truncation minimize token usage while preserving semantic meaning, using whitelists and deduplication logic.
- The pipeline outputs deterministic, reproducible representations stored in
simplifiedHTML, supported byselectorMapandelementTextMapfor bidirectional LLM-to-DOM translation.
Frequently Asked Questions
What is the purpose of highlight indices in the DOM dehydration pipeline?
Highlight indices are sequential numeric identifiers (e.g., [0], [1]) assigned to every interactive element during the getFlatTree() extraction phase. These indices allow the LLM to reference specific UI components unambiguously when issuing commands like "click element 3" or "type into element 7", bridging the gap between textual understanding and concrete DOM manipulation.
How does the pipeline identify and mark newly appeared interactive elements?
The system maintains a WeakMap called newElementsCache that stores references to previously seen interactive nodes. During DOM extraction, if an interactive element's DOM reference is not found in this cache, the pipeline sets isNew = true on the node. When processNode() renders the tree, these nodes receive a * prefix (e.g., *[2]) to alert the LLM that these are fresh interactive opportunities.
What attribute filtering rules does the DOM dehydration pipeline apply?
The pipeline filters element attributes against a built-in whitelist (DEFAULT_INCLUDE_ATTRIBUTES) combined with any caller-supplied attributes. It removes redundant entries where the role attribute equals the tag name, deduplicates repeating values, and truncates long strings using capTextLength(). This ensures that only semantically relevant, concise attribute data reaches the LLM, optimizing token usage.
How does the LLM use the dehydration pipeline output to interact with the browser?
The LLM receives the simplifiedHTML string containing indexed element markers. When the LLM responds with actions referencing these indices (e.g., "click [5]"), the PageController uses selectorMap to resolve index 5 to the actual DOM node and elementTextMap to verify context. This bidirectional mapping allows the LLM to perceive the page through text and act upon it through the original DOM references.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →