# Building Real-Time Web Scraping Node.js Applications: Efficient Libraries for Fetching and Parsing HTML

> Build real-time web scraping Node.js apps with Undici for fast fetching and Cheerio for efficient HTML parsing. Master high-performance scraping techniques.

- Repository: [Node.js/node](https://github.com/nodejs/node)
- Tags: tutorial
- Published: 2026-02-16

---

**For real-time web scraping Node.js applications, use Undici for high-performance HTTP fetching backed by native C++ parsers and connection pooling, and Cheerio for fast, memory-efficient HTML parsing with a jQuery-style API.**

Building real-time web scraping Node.js applications requires libraries that minimize latency and memory overhead while handling concurrent connections. The `nodejs/node` repository provides core infrastructure through modules like [`lib/internal/url.js`](https://github.com/nodejs/node/blob/main/lib/internal/url.js) and [`lib/internal/http2.js`](https://github.com/nodejs/node/blob/main/lib/internal/http2.js), while bundling high-performance dependencies such as Undici in `deps/undici`. This guide examines the most efficient libraries for fetching and parsing HTML content based on the actual source implementation.

## High-Performance HTTP Fetching with Undici

When scraping at scale, the HTTP client must handle many concurrent connections, TLS handshakes, and streaming data with minimal latency. **Undici**, the modern HTTP client that backs Node.js’s internal `fetch` implementation, delivers far better throughput than the legacy `http` module. According to the source in `deps/undici`, Undici uses native C++-based parsers and maintains connection pools per origin, supporting both HTTP/1.1 and HTTP/2 via [`lib/internal/http2.js`](https://github.com/nodejs/node/blob/main/lib/internal/http2.js).

For real-time scraping, use Undici’s `request` method to stream responses without buffering entire documents into memory:

```javascript
import { request } from 'undici';
import cheerio from 'cheerio';

async function scrapeHeadlines(url) {
  // Undici automatically reuses connections via internal pooling
  const { statusCode, body } = await request(url, {
    method: 'GET',
    headers: { 'accept': 'text/html' },
  });

  if (statusCode !== 200) {
    throw new Error(`Unexpected status ${statusCode}`);
  }

  // Collect HTML for parsing (suitable for small-to-medium pages)
  const chunks = [];
  for await (const chunk of body) {
    chunks.push(chunk);
  }
  const html = Buffer.concat(chunks).toString('utf8');

  // Parse with Cheerio for jQuery-style selectors
  const $ = cheerio.load(html);
  const headlines = [];
  $('h1, h2, h3').each((_, el) => {
    headlines.push($(el).text().trim());
  });

  return headlines;
}

```

*Key source references:*  
- Undici’s implementation resides in `deps/undici` and powers the internal fetch logic referenced in [`lib/internal/url.js`](https://github.com/nodejs/node/blob/main/lib/internal/url.js).  
- HTTP/2 support leverages [`lib/internal/http2.js`](https://github.com/nodejs/node/blob/main/lib/internal/http2.js) for multiplexed streams.

## Efficient HTML Parsing Libraries

Once you fetch HTML, you need a parser that operates efficiently on strings or streams without loading a full browser engine. The choice depends on whether you need DOM manipulation, streaming capabilities, or JavaScript execution.

### Cheerio for jQuery-Style DOM Manipulation

**Cheerio** provides a fast, jQuery-like API for parsing static HTML. It wraps `htmlparser2` to build a lightweight DOM representation entirely in memory, without the overhead of a browser. As noted in the analysis of `deps/cheerio`, this approach is battle-tested for scraping projects where you only need to extract data from server-rendered markup.

```javascript
import cheerio from 'cheerio';

// Load HTML into Cheerio
const html = '<ul id="fruits"><li class="apple">Apple</li><li class="orange">Orange</li></ul>';
const $ = cheerio.load(html);

// Use CSS selectors to extract data
const fruits = [];
$('.apple, .orange').each((i, elem) => {
  fruits.push($(elem).text());
});

console.log(fruits); // ['Apple', 'Orange']

```

### Streaming HTML Parsing with node-html-parser

For very large documents that exceed available memory, **node-html-parser** offers a SAX-style streaming interface. Rather than buffering the entire HTML string, this parser emits events (`onopentag`, `ontext`, `onclosetag`) as the stream flows through, allowing you to extract data without loading the full document into the heap.

```javascript
import { request } from 'undici';
import { Parser } from 'node-html-parser';

async function streamScrape(url) {
  const { body } = await request(url, { method: 'GET' });
  
  const parser = new Parser({
    onopentag(name, attrs) {
      // Capture image sources as they appear in the stream
      if (name === 'img' && attrs.src) {
        console.log('Image source found:', attrs.src);
      }
    },
    ontext(text) {
      // Process text content without buffering the whole page
    }
  });

  // Feed chunks directly into the parser
  for await (const chunk of body) {
    parser.write(chunk);
  }
  parser.end();
}

```

### Full Browser Emulation with jsdom

When scraping single-page applications or sites that rely heavily on client-side JavaScript to render content, **jsdom** provides a complete virtual browser environment. It implements the WHATWG DOM standard, allowing you to execute scripts and interact with a `window` object. However, this comes at the cost of significantly higher CPU and memory usage compared to Cheerio or streaming parsers.

```javascript
import { request } from 'undici';
import { JSDOM } from 'jsdom';

async function scrapeWithJsdom(url) {
  const { body } = await request(url, { method: 'GET' });
  
  // Collect the full HTML
  let html = '';
  for await (const chunk of body) {
    html += chunk;
  }

  // Create a virtual DOM with script execution enabled
  const dom = new JSDOM(html, { runScripts: 'dangerously' });
  const document = dom.window.document;

  // Extract data using standard DOM APIs
  const items = [...document.querySelectorAll('.product')].map(el => ({
    title: el.querySelector('.title')?.textContent.trim(),
    price: el.querySelector('.price')?.textContent.trim(),
  }));

  return items;
}

```

## Key Node.js Source Files Supporting Web Scraping Workflows

The Node.js repository contains specific modules that power these scraping libraries:

| File | Purpose | Location |
|------|---------|----------|
| [`lib/internal/url.js`](https://github.com/nodejs/node/blob/main/lib/internal/url.js) | Core URL parsing and resolution logic used by the built-in fetch implementation (Undici-derived) | `nodejs/node` |
| [`lib/internal/http2.js`](https://github.com/nodejs/node/blob/main/lib/internal/http2.js) | HTTP/2 client implementation that Undici leverages for multiplexed streams | `nodejs/node` |
| [`lib/internal/performance.js`](https://github.com/nodejs/node/blob/main/lib/internal/performance.js) | Performance timers for measuring request latency in real-time scrapers | `nodejs/node` |
| [`deps/undici/README.md`](https://github.com/nodejs/node/blob/main/deps/undici/README.md) | Documentation for Undici, the high-performance HTTP client bundled with Node | `nodejs/node` |
| [`deps/cheerio/README.md`](https://github.com/nodejs/node/blob/main/deps/cheerio/README.md) | Cheerio usage guide for fast HTML parsing (commonly used third-party dependency) | `nodejs/node` ecosystem |

These files illustrate how Node’s core modules and bundled libraries provide the building blocks for high-throughput, low-latency web scraping pipelines.

## Summary

- **Use Undici** for HTTP fetching in `web scraping nodejs` applications, leveraging its native C++ parsers and connection pooling as implemented in `deps/undici` and integrated with [`lib/internal/url.js`](https://github.com/nodejs/node/blob/main/lib/internal/url.js).
- **Use Cheerio** for fast, memory-efficient HTML parsing when you need CSS selector support without browser overhead, utilizing the jQuery-style API referenced in `deps/cheerio`.
- **Use node-html-parser** for streaming large documents via SAX-style event parsing to avoid buffering entire pages into memory.
- **Use jsdom** only when client-side JavaScript execution is required, accepting the higher CPU and memory costs.
- **Monitor performance** using [`lib/internal/performance.js`](https://github.com/nodejs/node/blob/main/lib/internal/performance.js) to optimize request latency in real-time scraping pipelines.

## Frequently Asked Questions

### What is the fastest HTTP client for web scraping in Node.js?

**Undici** provides the highest throughput for web scraping Node.js applications, utilizing native C++-based parsers and maintaining connection pools per origin as documented in `deps/undici`. It powers Node's internal fetch implementation and integrates with [`lib/internal/url.js`](https://github.com/nodejs/node/blob/main/lib/internal/url.js) for URL resolution and [`lib/internal/http2.js`](https://github.com/nodejs/node/blob/main/lib/internal/http2.js) for HTTP/2 multiplexing.

### Should I use Cheerio or jsdom for HTML parsing?

Use **Cheerio** for static HTML parsing when you need CSS selector support without JavaScript execution, as it provides a lightweight jQuery-style API via `deps/cheerio` and operates entirely in memory without browser overhead. Use **jsdom** only when the target page requires client-side script execution to render content, as it implements the full WHATWG DOM standard at the cost of significantly higher CPU and memory usage.

### How do I handle large HTML documents without running out of memory?

For large documents in real-time web scraping Node.js pipelines, use **node-html-parser** with its SAX-style streaming interface to process HTML chunks as they arrive via `onopentag` and `ontext` events. This approach avoids buffering the entire document into memory, allowing you to extract data from multi-gigabyte pages while maintaining constant memory usage.

### Does Node.js have built-in support for web scraping?

Node.js provides core networking infrastructure through modules like [`lib/internal/url.js`](https://github.com/nodejs/node/blob/main/lib/internal/url.js), [`lib/internal/http2.js`](https://github.com/nodejs/node/blob/main/lib/internal/http2.js), and [`lib/internal/performance.js`](https://github.com/nodejs/node/blob/main/lib/internal/performance.js), but dedicated scraping libraries are required for efficient operations. The runtime bundles **Undici** in `deps/undici` for high-performance HTTP fetching, while **Cheerio** (available via npm and referenced in `deps/cheerio` documentation patterns) provides the parsing capabilities needed for production scrapers.