# How the DOM Dehydration Pipeline Converts Live DOM to LLM-Consumable Text

> Discover how the DOM dehydration pipeline converts live DOM to text for LLM consumption. It flattens, filters, and indexes UI elements for precise LLM manipulation and reference.

- Repository: [Alibaba/page-agent](https://github.com/alibaba/page-agent)
- Tags: internals
- Published: 2026-03-09

---

**The DOM dehydration pipeline transforms a live browser DOM into a linear, indexed text representation by flattening the hierarchy, filtering redundant attributes, and assigning numeric highlight indices to interactive elements, enabling LLMs to reference and manipulate specific UI components using concise numeric identifiers.**

The **alibaba/page-agent** repository implements a sophisticated browser automation system that bridges web pages and AI agents. At its core lies the **DOM dehydration pipeline**, which solves the fundamental problem of how large language models (LLMs) can perceive and reason about dynamic web content. This pipeline converts the complex, nested Document Object Model into a deterministic, text-based format that preserves essential interactive elements while removing visual noise.

## Three-Stage DOM Dehydration Pipeline

The pipeline operates through three distinct phases, each implemented in [`packages/page-controller/src/dom/index.ts`](https://github.com/alibaba/page-agent/blob/main/packages/page-controller/src/dom/index.ts). This modular approach ensures that the live DOM state is captured, structured, and rendered into a format optimized for token-efficient LLM consumption.

### Stage 1: DOM Extraction and Flattening

The process begins with `getFlatTree()`, which traverses the live document to build a **FlatDomTree** data structure. This function configures a low-level DOM extractor (`domTree`) with parameters for viewport expansion, element highlighting, and visibility filtering.

During this traversal, every interactive element receives a sequential `highlightIndex` (e.g., `[0]`, `[1]`, `[2]`) that serves as a stable reference for the LLM. The system maintains a `WeakMap` called `newElementsCache` to track whether an interactive node has been seen in previous snapshots. If a node is newly discovered, it receives an `isNew` flag, which later renders as a `*` prefix in the output to alert the LLM to fresh interactive elements.

```typescript
// packages/page-controller/src/dom/index.ts
export function getFlatTree(config: DomConfig): FlatDomTree {
  const elements = domTree({
    doHighlightElements: true,
    viewportExpansion: VIEWPORT_EXPANSION,
  }) as FlatDomTree;

  // Mark newly-seen interactive nodes
  for (const nodeId in elements.map) {
    const node = elements.map[nodeId];
    if (node.isInteractive && node.ref && !newElementsCache.has(node.ref as HTMLElement)) {
      newElementsCache.set(node.ref as HTMLElement, window.location.href);
      node.isNew = true;
    }
  }
  return elements;
}

```

### Stage 2: Hierarchical Tree Construction

Once the flat map is generated, `flatTreeToString()` initiates the conversion to a hierarchical structure. The function `buildTreeNode()` recursively constructs `TreeNode` instances starting from `flatTree.rootId`, preserving parent-child relationships that were flattened during extraction.

The `setParentReferences()` function then attaches upward pointers to each node, enabling depth-first traversal while maintaining proper indentation levels in the final output. This step also tags scrollable containers and preserves metadata necessary for reconstructing the page's logical structure.

### Stage 3: Textual Rendering and Attribute Filtering

The final transformation occurs in `processNode()`, which walks the hierarchical tree depth-first to generate the LLM-readable string. For **interactive elements** (those with a `highlightIndex`), the pipeline collects visible text up to the next clickable element using `getAllTextTillNextClickableElement()`, filters attributes against a whitelist (`DEFAULT_INCLUDE_ATTRIBUTES`), removes duplicates (such as `role` attributes matching tag names), and truncates long values via `capTextLength()`.

The output format follows a strict convention: `[index]<tag attributes>text</>`. New elements display as `*[index]`, while scrollable containers append `data-scrollable` annotations with coordinate data. **Plain text nodes** emit as indented content unless an ancestor already carries a highlight index, preventing duplicate content.

```typescript
// packages/page-controller/src/dom/index.ts
export function flatTreeToString(
  flatTree: FlatDomTree,
  includeAttributes?: string[]
): string {
  const rootNode = buildTreeNode(flatTree.rootId);
  setParentReferences(rootNode);
  const result: string[] = [];
  processNode(rootNode, 0, result);
  return result.join('\n');
}

```

Example output:

```

[0]<a aria-label=Home>Home</>
[1]<button>Login</>
    Welcome to our site!
*[2]<input placeholder=Search />

```

## PageController Integration and Action Mapping

The `PageController` class orchestrates the pipeline through its `updateTree()` method in [`packages/page-controller/src/PageController.ts`](https://github.com/alibaba/page-agent/blob/main/packages/page-controller/src/PageController.ts). After generating the textual representation, it stores the result in `this.simplifiedHTML` and constructs two critical lookup tables:

- **`selectorMap`**: Maps highlight indices to actual DOM nodes via `dom.getSelectorMap()`, enabling the system to resolve LLM actions ("click element 5") to concrete elements.
- **`elementTextMap`**: Maps indices to raw text lines via `dom.getElementTextMap()`, providing context for text input operations.

This architecture isolates the LLM from raw DOM manipulation, ensuring that the same page state always produces identical text output—a requirement for reproducible AI reasoning.

## Summary

- The **DOM dehydration pipeline** in `alibaba/page-agent` converts live DOM into LLM-friendly text through three stages: flattening (`getFlatTree`), hierarchical reconstruction (`buildTreeNode`), and textual rendering (`processNode`).
- Interactive elements receive **highlight indices** that serve as stable numeric references for LLM actions, with new elements marked by `*` prefixes.
- **Attribute filtering** and **text truncation** minimize token usage while preserving semantic meaning, using whitelists and deduplication logic.
- The pipeline outputs deterministic, reproducible representations stored in `simplifiedHTML`, supported by `selectorMap` and `elementTextMap` for bidirectional LLM-to-DOM translation.

## Frequently Asked Questions

### What is the purpose of highlight indices in the DOM dehydration pipeline?

Highlight indices are sequential numeric identifiers (e.g., `[0]`, `[1]`) assigned to every interactive element during the `getFlatTree()` extraction phase. These indices allow the LLM to reference specific UI components unambiguously when issuing commands like "click element 3" or "type into element 7", bridging the gap between textual understanding and concrete DOM manipulation.

### How does the pipeline identify and mark newly appeared interactive elements?

The system maintains a `WeakMap` called `newElementsCache` that stores references to previously seen interactive nodes. During DOM extraction, if an interactive element's DOM reference is not found in this cache, the pipeline sets `isNew = true` on the node. When `processNode()` renders the tree, these nodes receive a `*` prefix (e.g., `*[2]`) to alert the LLM that these are fresh interactive opportunities.

### What attribute filtering rules does the DOM dehydration pipeline apply?

The pipeline filters element attributes against a built-in whitelist (`DEFAULT_INCLUDE_ATTRIBUTES`) combined with any caller-supplied attributes. It removes redundant entries where the `role` attribute equals the tag name, deduplicates repeating values, and truncates long strings using `capTextLength()`. This ensures that only semantically relevant, concise attribute data reaches the LLM, optimizing token usage.

### How does the LLM use the dehydration pipeline output to interact with the browser?

The LLM receives the `simplifiedHTML` string containing indexed element markers. When the LLM responds with actions referencing these indices (e.g., "click [5]"), the `PageController` uses `selectorMap` to resolve index 5 to the actual DOM node and `elementTextMap` to verify context. This bidirectional mapping allows the LLM to perceive the page through text and act upon it through the original DOM references.