Building Real-Time Web Scraping Node.js Applications: Efficient Libraries for Fetching and Parsing HTML

For real-time web scraping Node.js applications, use Undici for high-performance HTTP fetching backed by native C++ parsers and connection pooling, and Cheerio for fast, memory-efficient HTML parsing with a jQuery-style API.

Building real-time web scraping Node.js applications requires libraries that minimize latency and memory overhead while handling concurrent connections. The nodejs/node repository provides core infrastructure through modules like lib/internal/url.js and lib/internal/http2.js, while bundling high-performance dependencies such as Undici in deps/undici. This guide examines the most efficient libraries for fetching and parsing HTML content based on the actual source implementation.

High-Performance HTTP Fetching with Undici

When scraping at scale, the HTTP client must handle many concurrent connections, TLS handshakes, and streaming data with minimal latency. Undici, the modern HTTP client that backs Node.js’s internal fetch implementation, delivers far better throughput than the legacy http module. According to the source in deps/undici, Undici uses native C++-based parsers and maintains connection pools per origin, supporting both HTTP/1.1 and HTTP/2 via lib/internal/http2.js.

For real-time scraping, use Undici’s request method to stream responses without buffering entire documents into memory:

import { request } from 'undici';
import cheerio from 'cheerio';

async function scrapeHeadlines(url) {
  // Undici automatically reuses connections via internal pooling
  const { statusCode, body } = await request(url, {
    method: 'GET',
    headers: { 'accept': 'text/html' },
  });

  if (statusCode !== 200) {
    throw new Error(`Unexpected status ${statusCode}`);
  }

  // Collect HTML for parsing (suitable for small-to-medium pages)
  const chunks = [];
  for await (const chunk of body) {
    chunks.push(chunk);
  }
  const html = Buffer.concat(chunks).toString('utf8');

  // Parse with Cheerio for jQuery-style selectors
  const $ = cheerio.load(html);
  const headlines = [];
  $('h1, h2, h3').each((_, el) => {
    headlines.push($(el).text().trim());
  });

  return headlines;
}

Key source references:

Efficient HTML Parsing Libraries

Once you fetch HTML, you need a parser that operates efficiently on strings or streams without loading a full browser engine. The choice depends on whether you need DOM manipulation, streaming capabilities, or JavaScript execution.

Cheerio for jQuery-Style DOM Manipulation

Cheerio provides a fast, jQuery-like API for parsing static HTML. It wraps htmlparser2 to build a lightweight DOM representation entirely in memory, without the overhead of a browser. As noted in the analysis of deps/cheerio, this approach is battle-tested for scraping projects where you only need to extract data from server-rendered markup.

import cheerio from 'cheerio';

// Load HTML into Cheerio
const html = '<ul id="fruits"><li class="apple">Apple</li><li class="orange">Orange</li></ul>';
const $ = cheerio.load(html);

// Use CSS selectors to extract data
const fruits = [];
$('.apple, .orange').each((i, elem) => {
  fruits.push($(elem).text());
});

console.log(fruits); // ['Apple', 'Orange']

Streaming HTML Parsing with node-html-parser

For very large documents that exceed available memory, node-html-parser offers a SAX-style streaming interface. Rather than buffering the entire HTML string, this parser emits events (onopentag, ontext, onclosetag) as the stream flows through, allowing you to extract data without loading the full document into the heap.

import { request } from 'undici';
import { Parser } from 'node-html-parser';

async function streamScrape(url) {
  const { body } = await request(url, { method: 'GET' });
  
  const parser = new Parser({
    onopentag(name, attrs) {
      // Capture image sources as they appear in the stream
      if (name === 'img' && attrs.src) {
        console.log('Image source found:', attrs.src);
      }
    },
    ontext(text) {
      // Process text content without buffering the whole page
    }
  });

  // Feed chunks directly into the parser
  for await (const chunk of body) {
    parser.write(chunk);
  }
  parser.end();
}

Full Browser Emulation with jsdom

When scraping single-page applications or sites that rely heavily on client-side JavaScript to render content, jsdom provides a complete virtual browser environment. It implements the WHATWG DOM standard, allowing you to execute scripts and interact with a window object. However, this comes at the cost of significantly higher CPU and memory usage compared to Cheerio or streaming parsers.

import { request } from 'undici';
import { JSDOM } from 'jsdom';

async function scrapeWithJsdom(url) {
  const { body } = await request(url, { method: 'GET' });
  
  // Collect the full HTML
  let html = '';
  for await (const chunk of body) {
    html += chunk;
  }

  // Create a virtual DOM with script execution enabled
  const dom = new JSDOM(html, { runScripts: 'dangerously' });
  const document = dom.window.document;

  // Extract data using standard DOM APIs
  const items = [...document.querySelectorAll('.product')].map(el => ({
    title: el.querySelector('.title')?.textContent.trim(),
    price: el.querySelector('.price')?.textContent.trim(),
  }));

  return items;
}

Key Node.js Source Files Supporting Web Scraping Workflows

The Node.js repository contains specific modules that power these scraping libraries:

File Purpose Location
lib/internal/url.js Core URL parsing and resolution logic used by the built-in fetch implementation (Undici-derived) nodejs/node
lib/internal/http2.js HTTP/2 client implementation that Undici leverages for multiplexed streams nodejs/node
lib/internal/performance.js Performance timers for measuring request latency in real-time scrapers nodejs/node
deps/undici/README.md Documentation for Undici, the high-performance HTTP client bundled with Node nodejs/node
deps/cheerio/README.md Cheerio usage guide for fast HTML parsing (commonly used third-party dependency) nodejs/node ecosystem

These files illustrate how Node’s core modules and bundled libraries provide the building blocks for high-throughput, low-latency web scraping pipelines.

Summary

  • Use Undici for HTTP fetching in web scraping nodejs applications, leveraging its native C++ parsers and connection pooling as implemented in deps/undici and integrated with lib/internal/url.js.
  • Use Cheerio for fast, memory-efficient HTML parsing when you need CSS selector support without browser overhead, utilizing the jQuery-style API referenced in deps/cheerio.
  • Use node-html-parser for streaming large documents via SAX-style event parsing to avoid buffering entire pages into memory.
  • Use jsdom only when client-side JavaScript execution is required, accepting the higher CPU and memory costs.
  • Monitor performance using lib/internal/performance.js to optimize request latency in real-time scraping pipelines.

Frequently Asked Questions

What is the fastest HTTP client for web scraping in Node.js?

Undici provides the highest throughput for web scraping Node.js applications, utilizing native C++-based parsers and maintaining connection pools per origin as documented in deps/undici. It powers Node's internal fetch implementation and integrates with lib/internal/url.js for URL resolution and lib/internal/http2.js for HTTP/2 multiplexing.

Should I use Cheerio or jsdom for HTML parsing?

Use Cheerio for static HTML parsing when you need CSS selector support without JavaScript execution, as it provides a lightweight jQuery-style API via deps/cheerio and operates entirely in memory without browser overhead. Use jsdom only when the target page requires client-side script execution to render content, as it implements the full WHATWG DOM standard at the cost of significantly higher CPU and memory usage.

How do I handle large HTML documents without running out of memory?

For large documents in real-time web scraping Node.js pipelines, use node-html-parser with its SAX-style streaming interface to process HTML chunks as they arrive via onopentag and ontext events. This approach avoids buffering the entire document into memory, allowing you to extract data from multi-gigabyte pages while maintaining constant memory usage.

Does Node.js have built-in support for web scraping?

Node.js provides core networking infrastructure through modules like lib/internal/url.js, lib/internal/http2.js, and lib/internal/performance.js, but dedicated scraping libraries are required for efficient operations. The runtime bundles Undici in deps/undici for high-performance HTTP fetching, while Cheerio (available via npm and referenced in deps/cheerio documentation patterns) provides the parsing capabilities needed for production scrapers.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →