# How Playwright-Based Liveness Verification Classifies Job Postings as Expired or Active

> Learn how santifer/career-ops uses Playwright to verify job posting liveness. Discover its rule-based pipeline that classifies postings as active, expired, or uncertain based on status, redirects, and content.

- Repository: [Santiago Fernández de Valderrama/career-ops](https://github.com/santifer/career-ops)
- Tags: how-to-guide
- Published: 2026-06-07

---

**Career-Ops uses a deterministic, rule-based pipeline that first checks HTTP status and URL redirects, then analyzes page content and visible apply controls to classify job postings as expired, active, or uncertain.**

The `santifer/career-ops` repository automates job search workflows by filtering dead listings before they reach users. Its **Playwright-based liveness verification** system combines browser automation with content heuristics to determine whether a posting is still accepting applications or has been removed.

## The Two-Stage Verification Pipeline

The verification process separates network-level data collection from content-level decision making. This architecture ensures that expensive browser operations happen once, while classification rules run cheaply against extracted data.

### Stage 1: Browser Navigation and Data Extraction

The `checkUrlLiveness` function in **`liveness-browser.mjs`** orchestrates the Playwright session. It launches a headless Chromium instance, navigates to the target URL, and waits for the DOM to fully load. The function captures three critical data points:

- **`finalUrl`**: The URL after all redirects resolve.
- **`bodyText`**: The raw inner text of the page body for keyword analysis.
- **`applyControls`**: An array of visible interactive elements that allow users to submit applications.

These values are packaged into a plain object and passed to the classifier.

### Stage 2: Rule-Based Classification

The `classifyLiveness` function in **`liveness-core.mjs`** receives the extracted data and applies a strictly ordered series of checks. Each check returns immediately upon matching, producing a result object with three properties: `result` (`expired`, `active`, or `uncertain`), `code` (a machine-readable reason), and `reason` (a human-readable explanation).

## Classification Rules: From Expired to Active

The classifier evaluates conditions in priority order, allowing hard failures to short-circuit before UI analysis begins.

**HTTP-Level Failures**

- **`http_gone`**: If the response status is `404` or `410`, the posting is immediately marked **expired**.

**Redirect Analysis**

- **`expired_url`**: If `finalUrl` matches any pattern in `EXPIRED_URL_PATTERNS` (such as `?error=true` or common job-board error paths), the posting is **expired**.

**Content Heuristics**

- **`expired_body`**: If `bodyText` matches `HARD_EXPIRED_PATTERNS` containing phrases like "job no longer available" or "position has been filled", the result is **expired**.
- **`listing_page`**: If the content matches `LISTING_PAGE_PATTERNS` (e.g., "12 jobs found"), the posting is **expired** because the specific job redirected to a generic search results page.
- **`insufficient_content`**: If `bodyText` contains fewer than `MIN_CONTENT_CHARS` (default 300), the posting is **expired** as the page likely contains only navigation footers or error shells.

**Active Indicators**

- **`apply_control_visible`**: If `hasApplyControl(applyControls)` detects visible apply buttons or links, the posting is marked **active** and all further checks are skipped.

**Fallback**

- **`no_apply_control`**: If none of the above conditions match, the posting is **uncertain** and flagged for manual review or retry.

## Integration with the Scan Workflow

When users run `node scan.mjs --verify`, the `verifyOffers` helper iterates through discovered URLs and invokes the two-stage pipeline. Results are bucketed into three distinct actions:

- **Active**: Postings classified as `active` proceed to the `verified` pipeline and are shown to users.
- **Expired**: Postings with `result: 'expired'` are written to `scan-history.tsv` with status `skipped_expired` and removed from the queue.
- **Uncertain**: Postings with `result: 'uncertain'` and no apply control are treated as `dropped` and logged as `skipped_no_apply_control`. Those failing due to navigation errors remain **uncertain** and are retried on subsequent scans.

This guarantees that only URLs exposing a genuine apply mechanism survive to the user-facing output.

## Safety Guards and Input Validation

Before Playwright consumes resources, `rejectPrivateOrInvalid` in **`liveness-browser.mjs`** blocks potentially dangerous or invalid inputs. It returns a guard result for:

- Non-HTTP protocols (`unsupported_protocol`).
- Localhost or private network addresses (`blocked_host`).
- Malformed URLs (`invalid_url`).

Guarded URLs never reach the classification stage, preventing the browser from attempting to load internal resources or unsupported schemes.

## Practical Code Examples

You can manually verify a single URL using the `checkUrlLiveness` driver:

```javascript
import { chromium } from 'playwright';
import { checkUrlLiveness } from './liveness-browser.mjs';

async function demo(url) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();
  const { result, code, reason } = await checkUrlLiveness(page, url);
  console.log(`${url} → ${result} (${code}): ${reason}`);
  await browser.close();
}

demo('https://example.com/job/123');

```

The core classification logic inside `classifyLiveness` follows this simplified structure:

```javascript
// Core classification snippet (simplified)
function classifyLiveness({status, finalUrl, bodyText, applyControls}) {
  if (status === 404) return {result:'expired', code:'http_gone'};
  if (EXPIRED_URL_PATTERNS.some(p=>p.test(finalUrl))) return {result:'expired', code:'expired_url'};
  if (HARD_EXPIRED_PATTERNS.some(p=>p.test(bodyText))) return {result:'expired', code:'expired_body'};
  if (applyControls.length) return {result:'active', code:'apply_control_visible'};
  // …fallback checks omitted for brevity
}

```

## Summary

- **Playwright-based liveness verification** in `santifer/career-ops` extracts page data via `checkUrlLiveness` in `liveness-browser.mjs` before applying deterministic rules in `classifyLiveness` from `liveness-core.mjs`.
- The system checks HTTP status, redirect patterns, expired-content keywords, and listing-page indicators before confirming activity through visible apply controls.
- Results are bucketed into `active`, `expired`, or `uncertain` outcomes, with only active postings proceeding to the user-facing pipeline.
- Safety guards in `rejectPrivateOrInvalid` prevent the browser from accessing private networks or invalid protocols.

## Frequently Asked Questions

### What does "uncertain" mean in Playwright-based liveness verification?

An **uncertain** classification indicates that the crawler loaded the page successfully but found none of the definitive markers for either expiration or activity. This typically occurs when the page lacks an apply button but also lacks explicit "job filled" language or HTTP errors. These URLs are logged as `skipped_no_apply_control` and excluded from results unless they show an apply control on retry.

### How does Career-Ops prevent false positives when a job is actually active?

The classifier prioritizes positive signals. If `checkUrlLiveness` detects visible apply controls in the `applyControls` array, it immediately returns `active` with code `apply_control_visible`, bypassing all expiration checks. This ensures that functional application buttons—regardless of page text content—are treated as authoritative evidence that the posting is live.

### Can I run liveness checks without triggering a full scan?

Yes. The repository includes **`check-liveness.mjs`**, a thin CLI wrapper around `checkUrlLiveness`. You can invoke this script directly against specific URLs to debug classification results or verify individual postings without executing the full `scan.mjs` pipeline with the `--verify` flag.

### Why does the system use 300 characters as the minimum content threshold?

The `MIN_CONTENT_CHARS` constant (default 300) filters out pages that render only navigation headers, footers, or generic error shells. When a job posting is removed, some sites return a valid HTTP 200 status with nearly empty HTML. By requiring at least 300 characters of body text, the `insufficient_content` rule catches these hollow responses and marks them expired before they reach the user.