internals

How PageController Handles DOM Extraction and Element Indexing in Alibaba Page Agent

March 9, 2026 alibaba/page-agent ↗

PageController transforms live web pages into LLM-friendly formats by traversing the DOM to build a flat tree, assigning stable numeric indices to interactive elements, and maintaining a selector map that enables precise programmatic actions.

The PageController (packages/page-controller) is the core orchestration component in the Alibaba page-agent repository. It converts complex browser DOM structures into compact, indexed representations that AI agents can understand and manipulate through stable numeric references.

The DOM Extraction Pipeline

Triggering the Extraction Process

The extraction sequence begins when an API consumer invokes pageController.updateTree() in packages/page-controller/src/PageController.ts. This method initiates a comprehensive DOM snapshot that captures the current state of the page, including all interactive elements within the configured viewport.

Preparation and Cleanup

Before traversal begins, the controller performs two critical setup steps. First, if a visual mask is active, it is temporarily disabled to ensure unobstructed DOM inspection. Second, all previous interaction highlights are removed via dom.cleanUpHighlights() in packages/page-controller/src/dom/index.ts, ensuring a clean state for the new indexing operation.

Building the Flat DOM Tree

Tree Generation with getFlatTree

The dom.getFlatTree(config) function in src/dom/index.ts serves as the entry point for DOM normalization. This function returns a flat DOM tree containing every node the library tracks, filtering based on visibility, interactivity, and the viewportExpansion parameter (which defaults to VIEWPORT_EXPANSION from src/constants.ts).

Deep Traversal and Caching

The heavy lifting occurs inside domTree() in packages/page-controller/src/dom/dom_tree/index.js. This implementation performs a depth-first walk of the page while maintaining a DOM_CACHE (lines 58-67) to store bounding rectangles, client rectangles, and computed styles. This caching strategy prevents layout thrashing by avoiding repeated reflow calculations during traversal.

Handling Shadow DOM and Iframes

The traversal algorithm recursively enters shadow roots via node.shadowRoot and descends into iframes by accessing node.contentDocument (around line 20,400). This ensures that interactive elements nested within web components or framed content receive consistent indexing alongside standard DOM nodes.

Detecting Interactive Elements

Visibility and Interactivity Heuristics

During traversal, each element undergoes rigorous filtering through functions like isElementVisible, isTopElement, and isInExpandedViewport. Interactivity determination relies on isInteractiveElement (lines 94-450), which combines multiple signals: cursor styles (interactiveCursors), tag name whitelists (a, button, input), ARIA roles, attached event listeners, and user-supplied whitelist/blacklist configurations.

Metadata Attachment

For elements passing the interactivity checks, additional metadata is attached via a WeakMap called extraData. This includes scrollability indicators and cached bounding rectangles accessed through getCachedBoundingRect and getCachedComputedStyle.

The Indexing System

Allocating Highlight Indices

The indexing mechanism centers on a module-level highlightIndex variable initialized at 0 (line 47 of dom_tree/index.js). When highlightElement identifies an interactive node that satisfies visibility and top-element requirements, it assigns the current index value to node.highlightIndex and increments the counter. This creates a stable, zero-based numeric identifier for each actionable element.

Creating the Selector Map

After the flat tree is complete, dom.getSelectorMap(flatTree) in src/dom/index.ts constructs a Map<number, InteractiveElementDomNode>. This map filters the tree for nodes where node.isInteractive && typeof node.highlightIndex === 'number', creating the bridge between numeric indices and actual DOM references.

Generating Simplified HTML

The dom.flatTreeToString(flatTree, includeAttributes) function converts the structured tree into indented text format where interactive nodes appear as [<index>]<tag attributes>content. The dom.getElementTextMap(simplifiedHTML) function then parses these lines using the regex /^\[(\d+)\]<[^>]+>([^<]*)/ to build a Map<number, string> mapping indices to human-readable descriptions.

Marking New Elements

The system tracks element persistence through the node.isNew flag. After building the tree, getFlatTree iterates over elements.map to flag nodes whose underlying DOM references have not appeared in previous snapshots, helping LLMs identify dynamic content changes.

Practical Implementation

Capturing an Indexed Snapshot

To extract the current page state with element indices:

import { PageController } from '@page-agent/page-controller'

async function snapshot() {
  const controller = new PageController({ 
    enableMask: true, 
    viewportExpansion: 200 
  })
  
  // Runs full extraction: cleaning → traversal → indexing → mapping
  const simplifiedHTML = await controller.updateTree()
  const state = await controller.getBrowserState()
  
  console.log(state.header)   // Page title + scroll position
  console.log(state.content)  // Simplified HTML with [0], [1], [2]...
}

snapshot()

Executing Actions by Index

Once indexed, elements are addressable through the selector map:

import { PageController } from '@page-agent/page-controller'

async function act() {
  const pc = new PageController()
  await pc.updateTree()  // Ensure tree is indexed: this.isIndexed = true
  
  // Looks up element in selectorMap via getElementByIndex()
  const result = await pc.clickElement(5)
  console.log(result.message) // "✅ Clicked element (Submit)."
  
  // Scroll to specific indexed element
  await pc.scroll({ down: true, numPages: 1, index: 7 })
}

act()

The clickElement method in PageController.ts (line 44) uses getElementByIndex(this.selectorMap, index) to resolve the numeric reference to a DOM node before delegating to actions.clickElement in src/actions.ts.

Summary

Orchestration: PageController.updateTree() in src/PageController.ts coordinates the entire extraction pipeline from cleanup through indexing completion.
Normalization: getFlatTree() and domTree() flatten complex DOM structures across shadow boundaries and iframe contexts into a uniform traversable format.
Optimization: The DOM_CACHE mechanism (lines 58-67) eliminates layout thrashing by caching computed styles and bounding rectangles during the single traversal pass.
Addressability: The selector map creates a stable Map<number, Node> bridge, enabling reliable programmatic interaction through numeric indices that persist across extraction cycles.
LLM Formatting: flatTreeToString generates the bracketed index format [0]<button> that large language models parse to understand available actions.

Frequently Asked Questions

How does PageController handle Shadow DOM and iframes during extraction?

The domTree() traversal in src/dom/dom_tree/index.js detects shadow roots and recursively processes node.shadowRoot, while iframe handling accesses node.contentDocument and runs the same tree builder inside the frame context (around line 20,400). This ensures interactive elements within web components or embedded documents receive sequential indices alongside standard DOM nodes.

What criteria determine if an element receives an index?

Elements must satisfy visibility checks (isElementVisible, isTopElement) and interactivity heuristics in isInteractiveElement (lines 94-450). The system evaluates cursor styles, tag name whitelists, ARIA roles, event listeners, and user-supplied filter lists. Only elements passing both visibility and interactivity tests receive a highlightIndex via the highlightElement function.

How does the selector map maintain stable references to indexed elements?

The getSelectorMap() function in src/dom/index.ts creates a Map<number, InteractiveElementDomNode> where keys are the numeric highlightIndex values and values are direct DOM node references. PageController stores this as this.selectorMap, allowing action methods like clickElement() to perform O(1) lookups of DOM nodes using the LLM-provided numeric indices.

Can developers customize which elements get indexed?

Yes. The getFlatTree(config) function accepts configuration parameters including viewportExpansion (margin around visible area) and whitelist/blacklist arrays. These filters combine with the built-in interactivity heuristics in isInteractiveElement to control which nodes receive highlight indices and appear in the simplified HTML output.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how alibaba/page-agent works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →