How Playwright Liveness Verification Detects Expired Job Postings in Career-Ops

Career-Ops uses a two-stage Playwright pipeline that first navigates to the job URL to capture the final redirect, page text, and apply buttons, then applies deterministic classification rules to mark postings as expired, active, or uncertain.

The open-source Career-Ops project (santifer/career-ops) automates job posting verification using a Playwright-based liveness detection system. This article explains how the codebase distinguishes between active opportunities and expired listings through a combination of network-level heuristics and DOM content analysis.

The Two-Stage Verification Architecture

The liveness verification system splits detection between browser automation and pure classification logic. This separation allows for both robust page interaction and deterministic rule evaluation without side effects.

Stage 1: Playwright Navigation and Data Extraction

In liveness-browser.mjs, the checkUrlLiveness function launches a headless Chromium instance to gather raw page telemetry:

  • The final URL after all redirects (finalUrl)
  • The raw inner text of the document body (bodyText)
  • A filtered list of visible apply controls (applyControls)

Before navigation begins, the rejectPrivateOrInvalid guard blocks non-HTTP protocols, localhost, and private network addresses, returning an immediate invalid_url or blocked_host result for security.

Stage 2: Deterministic Classification

The classifyLiveness function in liveness-core.mjs receives the extracted data and applies a priority-ordered rule set. Each evaluation returns a structured object containing result (expired/active/uncertain), code, and reason.

The classification hierarchy evaluates conditions in this exact order:

  1. HTTP Status Validation – Checks if status === 404 or 410, returning expired with code http_gone.

  2. Redirect URL Analysis – Matches finalUrl against EXPIRED_URL_PATTERNS (e.g., ?error=true), returning expired with code expired_url.

  3. Body Text Heuristics – Searches bodyText for HARD_EXPIRED_PATTERNS like "job no longer available" or "position has been filled", returning expired with code expired_body.

  4. Apply Control Detection – If hasApplyControl(applyControls) identifies visible application buttons or links, returns active with code apply_control_visible.

  5. Listing Page Detection – Matches LISTING_PAGE_PATTERNS indicating search results pages (e.g., "12 jobs found"), returning expired with code listing_page.

  6. Content Sufficiency Check – Validates bodyText length against MIN_CONTENT_CHARS (default 300 characters), returning expired with code insufficient_content if the content is too short (typical of navigation-only pages).

  7. Fallback Classification – Returns uncertain with code no_apply_control if no preceding rules match.

Integration with the Scan Workflow

When running node scan.mjs --verify, the verifyOffers helper orchestrates the verification batch. It iterates over discovered URLs, invokes checkUrlLiveness, and routes results into three categories:

  • Active – Postings with visible apply controls proceed to the verified pipeline.
  • Expired – URLs flagged as expired are written to scan-history.tsv with status skipped_expired.
  • Uncertain – Results lacking apply controls are treated as dropped and logged as skipped_no_apply_control, while navigation errors remain uncertain for retry on subsequent scans.

Security and Reliability Guardrails

The rejectPrivateOrInvalid utility in liveness-browser.mjs prevents the browser from accessing internal infrastructure by validating URLs before page load. This ensures that private IP ranges and unsupported protocols never reach the classification stage, protecting both the scanning infrastructure and target systems.

Practical Usage Examples

To check a single URL programmatically:

import { chromium } from 'playwright';
import { checkUrlLiveness } from './liveness-browser.mjs';

async function validateJob(url) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();
  const { result, code, reason } = await checkUrlLiveness(page, url);
  console.log(`${url} → ${result} (${code}): ${reason}`);
  await browser.close();
}

validateJob('https://example.com/job/123');

The core classification logic follows this deterministic flow:

function classifyLiveness({status, finalUrl, bodyText, applyControls}) {
  if (status === 404 || status === 410) 
    return {result: 'expired', code: 'http_gone'};
  
  if (EXPIRED_URL_PATTERNS.some(p => p.test(finalUrl))) 
    return {result: 'expired', code: 'expired_url'};
  
  if (HARD_EXPIRED_PATTERNS.some(p => p.test(bodyText))) 
    return {result: 'expired', code: 'expired_body'};
  
  if (applyControls.length > 0) 
    return {result: 'active', code: 'apply_control_visible'};
  
  // Additional fallback checks omitted for brevity
  return {result: 'uncertain', code: 'no_apply_control'};
}

Summary

  • Playwright liveness verification in Career-Ops separates data collection from classification logic across liveness-browser.mjs and liveness-core.mjs.
  • The system prioritizes early-exit checks for HTTP errors and expired URL patterns before analyzing page content.
  • Visible apply controls serve as the primary signal for active postings, ensuring only actionable job links reach users.
  • Security guardrails in rejectPrivateOrInvalid prevent scanning of private networks and invalid protocols.
  • Integration via verifyOffers in scan.mjs provides automated batch processing with persistent history tracking.

Frequently Asked Questions

What triggers the "insufficient_content" classification?

When the extracted bodyText contains fewer than 300 characters (configurable via MIN_CONTENT_CHARS), the system classifies the posting as expired with code insufficient_content. This heuristic catches pages that render only navigation footers or skeleton loaders without actual job details.

How does the system handle redirects to expired job pages?

The checkUrlLiveness function captures the finalUrl after all redirects complete. If this final URL matches patterns defined in EXPIRED_URL_PATTERNS (such as query parameters containing error=true), the classifier immediately returns expired_url before inspecting the page body.

Can the liveness checker distinguish between a filled position and a removed posting?

Yes. The HARD_EXPIRED_PATTERNS array includes specific phrases like "position has been filled" and "job no longer available" to differentiate between various expiration states. Both result in an expired classification, but with the detailed reason code expired_body for audit purposes.

Why are some URLs classified as "uncertain" rather than expired?

URLs receive uncertain status when they load successfully but lack visible apply controls and do not match any hard expiration patterns. These represent ambiguous cases—such as JavaScript-heavy pages with delayed rendering—that the system flags for manual review or retry on subsequent scans rather than risking false negatives.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →