How Playwright Liveness Verification Detects Expired Job Postings in Career-Ops
Career-Ops uses a two-stage Playwright pipeline that first navigates to the job URL to capture the final redirect, page text, and apply buttons, then applies deterministic classification rules to mark postings as expired, active, or uncertain.
The open-source Career-Ops project (santifer/career-ops) automates job posting verification using a Playwright-based liveness detection system. This article explains how the codebase distinguishes between active opportunities and expired listings through a combination of network-level heuristics and DOM content analysis.
The Two-Stage Verification Architecture
The liveness verification system splits detection between browser automation and pure classification logic. This separation allows for both robust page interaction and deterministic rule evaluation without side effects.
Stage 1: Playwright Navigation and Data Extraction
In liveness-browser.mjs, the checkUrlLiveness function launches a headless Chromium instance to gather raw page telemetry:
- The final URL after all redirects (
finalUrl) - The raw inner text of the document body (
bodyText) - A filtered list of visible apply controls (
applyControls)
Before navigation begins, the rejectPrivateOrInvalid guard blocks non-HTTP protocols, localhost, and private network addresses, returning an immediate invalid_url or blocked_host result for security.
Stage 2: Deterministic Classification
The classifyLiveness function in liveness-core.mjs receives the extracted data and applies a priority-ordered rule set. Each evaluation returns a structured object containing result (expired/active/uncertain), code, and reason.
The classification hierarchy evaluates conditions in this exact order:
-
HTTP Status Validation – Checks if
status === 404or410, returning expired with codehttp_gone. -
Redirect URL Analysis – Matches
finalUrlagainstEXPIRED_URL_PATTERNS(e.g.,?error=true), returning expired with codeexpired_url. -
Body Text Heuristics – Searches
bodyTextforHARD_EXPIRED_PATTERNSlike "job no longer available" or "position has been filled", returning expired with codeexpired_body. -
Apply Control Detection – If
hasApplyControl(applyControls)identifies visible application buttons or links, returns active with codeapply_control_visible. -
Listing Page Detection – Matches
LISTING_PAGE_PATTERNSindicating search results pages (e.g., "12 jobs found"), returning expired with codelisting_page. -
Content Sufficiency Check – Validates
bodyTextlength againstMIN_CONTENT_CHARS(default 300 characters), returning expired with codeinsufficient_contentif the content is too short (typical of navigation-only pages). -
Fallback Classification – Returns uncertain with code
no_apply_controlif no preceding rules match.
Integration with the Scan Workflow
When running node scan.mjs --verify, the verifyOffers helper orchestrates the verification batch. It iterates over discovered URLs, invokes checkUrlLiveness, and routes results into three categories:
- Active – Postings with visible apply controls proceed to the verified pipeline.
- Expired – URLs flagged as expired are written to
scan-history.tsvwith statusskipped_expired. - Uncertain – Results lacking apply controls are treated as
droppedand logged asskipped_no_apply_control, while navigation errors remain uncertain for retry on subsequent scans.
Security and Reliability Guardrails
The rejectPrivateOrInvalid utility in liveness-browser.mjs prevents the browser from accessing internal infrastructure by validating URLs before page load. This ensures that private IP ranges and unsupported protocols never reach the classification stage, protecting both the scanning infrastructure and target systems.
Practical Usage Examples
To check a single URL programmatically:
import { chromium } from 'playwright';
import { checkUrlLiveness } from './liveness-browser.mjs';
async function validateJob(url) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
const { result, code, reason } = await checkUrlLiveness(page, url);
console.log(`${url} → ${result} (${code}): ${reason}`);
await browser.close();
}
validateJob('https://example.com/job/123');
The core classification logic follows this deterministic flow:
function classifyLiveness({status, finalUrl, bodyText, applyControls}) {
if (status === 404 || status === 410)
return {result: 'expired', code: 'http_gone'};
if (EXPIRED_URL_PATTERNS.some(p => p.test(finalUrl)))
return {result: 'expired', code: 'expired_url'};
if (HARD_EXPIRED_PATTERNS.some(p => p.test(bodyText)))
return {result: 'expired', code: 'expired_body'};
if (applyControls.length > 0)
return {result: 'active', code: 'apply_control_visible'};
// Additional fallback checks omitted for brevity
return {result: 'uncertain', code: 'no_apply_control'};
}
Summary
- Playwright liveness verification in Career-Ops separates data collection from classification logic across
liveness-browser.mjsandliveness-core.mjs. - The system prioritizes early-exit checks for HTTP errors and expired URL patterns before analyzing page content.
- Visible apply controls serve as the primary signal for active postings, ensuring only actionable job links reach users.
- Security guardrails in
rejectPrivateOrInvalidprevent scanning of private networks and invalid protocols. - Integration via
verifyOffersinscan.mjsprovides automated batch processing with persistent history tracking.
Frequently Asked Questions
What triggers the "insufficient_content" classification?
When the extracted bodyText contains fewer than 300 characters (configurable via MIN_CONTENT_CHARS), the system classifies the posting as expired with code insufficient_content. This heuristic catches pages that render only navigation footers or skeleton loaders without actual job details.
How does the system handle redirects to expired job pages?
The checkUrlLiveness function captures the finalUrl after all redirects complete. If this final URL matches patterns defined in EXPIRED_URL_PATTERNS (such as query parameters containing error=true), the classifier immediately returns expired_url before inspecting the page body.
Can the liveness checker distinguish between a filled position and a removed posting?
Yes. The HARD_EXPIRED_PATTERNS array includes specific phrases like "position has been filled" and "job no longer available" to differentiate between various expiration states. Both result in an expired classification, but with the detailed reason code expired_body for audit purposes.
Why are some URLs classified as "uncertain" rather than expired?
URLs receive uncertain status when they load successfully but lack visible apply controls and do not match any hard expiration patterns. These represent ambiguous cases—such as JavaScript-heavy pages with delayed rendering—that the system flags for manual review or retry on subsequent scans rather than risking false negatives.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →