How to Implement a `drag_element` Tool for Complex UI Interactions in Page Agent

To implement a drag_element tool in the alibaba/page-agent repository, create a dragElement action in packages/page-controller/src/actions.ts that synthesizes the full mouse event sequence, then expose it as a tool in packages/core/src/tools/index.ts following the existing Zod schema pattern.

The alibaba/page-agent framework separates UI automation into a Core package for tool definitions and a Page-Controller package for DOM execution. When you need to implement a drag_element tool for complex interactions like drag-and-drop lists or sliders, you extend both layers while maintaining the architecture's clean separation between agent logic and browser manipulation.

Architecture Overview: Core vs. Page-Controller

Page Agent uses a two-tier architecture where tools define the contract and actions perform the work. The Core package (packages/core/src/tools/index.ts) maintains a Map<string, PageAgentTool> called tools (defined at line 28) that registers all available capabilities. The Page-Controller package (packages/page-controller/src/actions.ts) contains the actual DOM manipulation logic. This separation ensures that UI-specific code remains isolated from the headless-ready Core, making the drag_element implementation reusable across environments.

Step 1: Implement the Drag Action in actions.ts

Add the low-level dragElement function to packages/page-controller/src/actions.ts. This function reuses the existing movePointerToElement helper (lines 15-22) and waitFor utility (lines 9-11) to simulate realistic pointer movement.

export async function dragElement(
  source: HTMLElement,
  target: HTMLElement,
  steps = 10
): Promise<string> {
  // Position cursor over source and initiate drag
  await movePointerToElement(source);
  window.dispatchEvent(new CustomEvent('PageAgent::ClickPointer'));
  source.dispatchEvent(new MouseEvent('mousedown', { bubbles: true, cancelable: true }));

  // Calculate path between element centers
  const src = source.getBoundingClientRect();
  const tgt = target.getBoundingClientRect();
  const startX = src.left + src.width / 2;
  const startY = src.top + src.height / 2;
  const endX = tgt.left + tgt.width / 2;
  const endY = tgt.top + tgt.height / 2;

  // Synthesize intermediate mousemove events
  for (let i = 1; i <= steps; i++) {
    const x = startX + ((endX - startX) * i) / steps;
    const y = startY + ((endY - startY) * i) / steps;
    window.dispatchEvent(new CustomEvent('PageAgent::MovePointerTo', { detail: { x, y } }));
    await waitFor(0.02);
  }

  // Release over target
  target.dispatchEvent(new MouseEvent('mouseup', { bubbles: true, cancelable: true }));
  target.dispatchEvent(new MouseEvent('click', { bubbles: true, cancelable: true }));
  await waitFor(0.1);
  
  return `✅ Dragged element ${source.tagName} → ${target.tagName}`;
}

This implementation mirrors the event sequence found in clickElement (lines 61-71), extending it with intermediate mousemove events to satisfy browser drag-and-drop APIs.

Step 2: Register the Tool in Core

Expose the action as a tool in packages/core/src/tools/index.ts using the pattern established by click_element_by_index (lines 84-94). The registration uses a Zod schema for parameter validation and the getElementByIndex helper (lines 28-46) to resolve element references.

tools.set(
  'drag_element',
  tool({
    description:
      'Drag an interactive element (source) onto another element (target). Useful for sliders, drag-and-drop lists, and canvas manipulations.',
    inputSchema: z.object({
      sourceIndex: z.number().int().min(0),
      targetIndex: z.number().int().min(0),
      steps: z.number().int().min(1).max(20).optional().default(10),
    }),
    execute: async function (this: PageAgentCore, input) {
      const source = getElementByIndex(this.pageController.selectorMap, input.sourceIndex);
      const target = getElementByIndex(this.pageController.selectorMap, input.targetIndex);
      const result = await this.pageController.dragElement(source, target, input.steps);
      return result;
    },
  })
);

The execute method receives the PageAgentCore instance as this, providing direct access to this.pageController without additional wiring.

Step 3: Expose the Method in PageController

Ensure the PageController class exposes a public method that forwards to the action. Add a thin wrapper in packages/page-controller/src/PageController.ts:

public async dragElement(source: HTMLElement, target: HTMLElement, steps?: number): Promise<string> {
  return actions.dragElement(source, target, steps);
}

This maintains consistency with existing methods like clickElement and inputTextElement.

Event Simulation and Browser Compatibility

The drag implementation relies on synthesized events rather than native Drag-and-Drop APIs. This approach works across complex web applications because:

  • Event bubbling: All mouse events use { bubbles: true, cancelable: true } to ensure framework event listeners (React, Vue, Angular) detect the interaction.
  • Pointer visualization: CustomEvents (PageAgent::MovePointerTo, PageAgent::ClickPointer) decouple the virtual cursor display from DOM state changes.
  • Timing realism: The 20ms delay between mousemove events mimics human pointer velocity, preventing anti-automation detection.

For absolute coordinate-based dragging (independent of specific elements), modify the tool schema to accept startX, startY, endX, endY parameters and bypass getElementByIndex.

Summary

  • Create the action: Add dragElement to packages/page-controller/src/actions.ts using movePointerToElement, mousedown, interpolated mousemove events, and mouseup.
  • Register the tool: Insert a new entry into the tools Map in packages/core/src/tools/index.ts with Zod validation for sourceIndex, targetIndex, and optional steps.
  • Maintain separation: Keep DOM logic in Page-Controller and tool definitions in Core to preserve the architectural boundary between UI-heavy and headless execution contexts.
  • Verify with tests: Run npm run lint && npm test to ensure type safety and prevent regression in existing tools like scroll and click_element_by_index.

Frequently Asked Questions

Where does the actual DOM manipulation for drag operations live?

The implementation belongs in packages/page-controller/src/actions.ts. This file contains all low-level browser automation including clickElement, inputTextElement, and the new dragElement function. Keeping DOM logic centralized here ensures the Core package remains UI-agnostic.

How does the tool calculate movement coordinates between elements?

The dragElement action uses source.getBoundingClientRect() and target.getBoundingClientRect() to determine the center points of both elements. It then linearly interpolates between these coordinates based on the steps parameter, generating intermediate positions for each mousemove event.

Can I modify the tool to use absolute screen coordinates instead of element indexes?

Yes. Update the Zod schema in tools/index.ts to accept startX, startY, endX, and endY as optional alternatives to sourceIndex and targetIndex. In the execute method, conditionally bypass getElementByIndex and pass the coordinates directly to the Page-Controller action.

Why does the implementation synthesize mouse events instead of using the HTML5 Drag and Drop API?

Synthetic mouse events (mousedown, mousemove, mouseup) provide broader compatibility with custom drag implementations (sliders, sortable lists, canvas drawing) that may not use the native Drag and Drop API. This approach also matches the existing clickElement pattern (lines 61-71) for consistent event simulation across the codebase.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →