How PageController Handles DOM Extraction and Element Indexing in Alibaba Page Agent
PageController transforms live web pages into LLM-friendly formats by traversing the DOM to build a flat tree, assigning stable numeric indices to interactive elements, and maintaining a selector map that enables precise programmatic actions.
The PageController (packages/page-controller) is the core orchestration component in the Alibaba page-agent repository. It converts complex browser DOM structures into compact, indexed representations that AI agents can understand and manipulate through stable numeric references.
The DOM Extraction Pipeline
Triggering the Extraction Process
The extraction sequence begins when an API consumer invokes pageController.updateTree() in packages/page-controller/src/PageController.ts. This method initiates a comprehensive DOM snapshot that captures the current state of the page, including all interactive elements within the configured viewport.
Preparation and Cleanup
Before traversal begins, the controller performs two critical setup steps. First, if a visual mask is active, it is temporarily disabled to ensure unobstructed DOM inspection. Second, all previous interaction highlights are removed via dom.cleanUpHighlights() in packages/page-controller/src/dom/index.ts, ensuring a clean state for the new indexing operation.
Building the Flat DOM Tree
Tree Generation with getFlatTree
The dom.getFlatTree(config) function in src/dom/index.ts serves as the entry point for DOM normalization. This function returns a flat DOM tree containing every node the library tracks, filtering based on visibility, interactivity, and the viewportExpansion parameter (which defaults to VIEWPORT_EXPANSION from src/constants.ts).
Deep Traversal and Caching
The heavy lifting occurs inside domTree() in packages/page-controller/src/dom/dom_tree/index.js. This implementation performs a depth-first walk of the page while maintaining a DOM_CACHE (lines 58-67) to store bounding rectangles, client rectangles, and computed styles. This caching strategy prevents layout thrashing by avoiding repeated reflow calculations during traversal.
Handling Shadow DOM and Iframes
The traversal algorithm recursively enters shadow roots via node.shadowRoot and descends into iframes by accessing node.contentDocument (around line 20,400). This ensures that interactive elements nested within web components or framed content receive consistent indexing alongside standard DOM nodes.
Detecting Interactive Elements
Visibility and Interactivity Heuristics
During traversal, each element undergoes rigorous filtering through functions like isElementVisible, isTopElement, and isInExpandedViewport. Interactivity determination relies on isInteractiveElement (lines 94-450), which combines multiple signals: cursor styles (interactiveCursors), tag name whitelists (a, button, input), ARIA roles, attached event listeners, and user-supplied whitelist/blacklist configurations.
Metadata Attachment
For elements passing the interactivity checks, additional metadata is attached via a WeakMap called extraData. This includes scrollability indicators and cached bounding rectangles accessed through getCachedBoundingRect and getCachedComputedStyle.
The Indexing System
Allocating Highlight Indices
The indexing mechanism centers on a module-level highlightIndex variable initialized at 0 (line 47 of dom_tree/index.js). When highlightElement identifies an interactive node that satisfies visibility and top-element requirements, it assigns the current index value to node.highlightIndex and increments the counter. This creates a stable, zero-based numeric identifier for each actionable element.
Creating the Selector Map
After the flat tree is complete, dom.getSelectorMap(flatTree) in src/dom/index.ts constructs a Map<number, InteractiveElementDomNode>. This map filters the tree for nodes where node.isInteractive && typeof node.highlightIndex === 'number', creating the bridge between numeric indices and actual DOM references.
Generating Simplified HTML
The dom.flatTreeToString(flatTree, includeAttributes) function converts the structured tree into indented text format where interactive nodes appear as [<index>]<tag attributes>content. The dom.getElementTextMap(simplifiedHTML) function then parses these lines using the regex /^\[(\d+)\]<[^>]+>([^<]*)/ to build a Map<number, string> mapping indices to human-readable descriptions.
Marking New Elements
The system tracks element persistence through the node.isNew flag. After building the tree, getFlatTree iterates over elements.map to flag nodes whose underlying DOM references have not appeared in previous snapshots, helping LLMs identify dynamic content changes.
Practical Implementation
Capturing an Indexed Snapshot
To extract the current page state with element indices:
import { PageController } from '@page-agent/page-controller'
async function snapshot() {
const controller = new PageController({
enableMask: true,
viewportExpansion: 200
})
// Runs full extraction: cleaning → traversal → indexing → mapping
const simplifiedHTML = await controller.updateTree()
const state = await controller.getBrowserState()
console.log(state.header) // Page title + scroll position
console.log(state.content) // Simplified HTML with [0], [1], [2]...
}
snapshot()
Executing Actions by Index
Once indexed, elements are addressable through the selector map:
import { PageController } from '@page-agent/page-controller'
async function act() {
const pc = new PageController()
await pc.updateTree() // Ensure tree is indexed: this.isIndexed = true
// Looks up element in selectorMap via getElementByIndex()
const result = await pc.clickElement(5)
console.log(result.message) // "✅ Clicked element (Submit)."
// Scroll to specific indexed element
await pc.scroll({ down: true, numPages: 1, index: 7 })
}
act()
The clickElement method in PageController.ts (line 44) uses getElementByIndex(this.selectorMap, index) to resolve the numeric reference to a DOM node before delegating to actions.clickElement in src/actions.ts.
Summary
- Orchestration:
PageController.updateTree()insrc/PageController.tscoordinates the entire extraction pipeline from cleanup through indexing completion. - Normalization:
getFlatTree()anddomTree()flatten complex DOM structures across shadow boundaries and iframe contexts into a uniform traversable format. - Optimization: The
DOM_CACHEmechanism (lines 58-67) eliminates layout thrashing by caching computed styles and bounding rectangles during the single traversal pass. - Addressability: The selector map creates a stable
Map<number, Node>bridge, enabling reliable programmatic interaction through numeric indices that persist across extraction cycles. - LLM Formatting:
flatTreeToStringgenerates the bracketed index format[0]<button>that large language models parse to understand available actions.
Frequently Asked Questions
How does PageController handle Shadow DOM and iframes during extraction?
The domTree() traversal in src/dom/dom_tree/index.js detects shadow roots and recursively processes node.shadowRoot, while iframe handling accesses node.contentDocument and runs the same tree builder inside the frame context (around line 20,400). This ensures interactive elements within web components or embedded documents receive sequential indices alongside standard DOM nodes.
What criteria determine if an element receives an index?
Elements must satisfy visibility checks (isElementVisible, isTopElement) and interactivity heuristics in isInteractiveElement (lines 94-450). The system evaluates cursor styles, tag name whitelists, ARIA roles, event listeners, and user-supplied filter lists. Only elements passing both visibility and interactivity tests receive a highlightIndex via the highlightElement function.
How does the selector map maintain stable references to indexed elements?
The getSelectorMap() function in src/dom/index.ts creates a Map<number, InteractiveElementDomNode> where keys are the numeric highlightIndex values and values are direct DOM node references. PageController stores this as this.selectorMap, allowing action methods like clickElement() to perform O(1) lookups of DOM nodes using the LLM-provided numeric indices.
Can developers customize which elements get indexed?
Yes. The getFlatTree(config) function accepts configuration parameters including viewportExpansion (margin around visible area) and whitelist/blacklist arrays. These filters combine with the built-in interactivity heuristics in isInteractiveElement to control which nodes receive highlight indices and appear in the simplified HTML output.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →