How Playwright-Based Liveness Verification Classifies Job Postings as Expired or Active
Career-Ops uses a deterministic, rule-based pipeline that first checks HTTP status and URL redirects, then analyzes page content and visible apply controls to classify job postings as expired, active, or uncertain.
The santifer/career-ops repository automates job search workflows by filtering dead listings before they reach users. Its Playwright-based liveness verification system combines browser automation with content heuristics to determine whether a posting is still accepting applications or has been removed.
The Two-Stage Verification Pipeline
The verification process separates network-level data collection from content-level decision making. This architecture ensures that expensive browser operations happen once, while classification rules run cheaply against extracted data.
Stage 1: Browser Navigation and Data Extraction
The checkUrlLiveness function in liveness-browser.mjs orchestrates the Playwright session. It launches a headless Chromium instance, navigates to the target URL, and waits for the DOM to fully load. The function captures three critical data points:
finalUrl: The URL after all redirects resolve.bodyText: The raw inner text of the page body for keyword analysis.applyControls: An array of visible interactive elements that allow users to submit applications.
These values are packaged into a plain object and passed to the classifier.
Stage 2: Rule-Based Classification
The classifyLiveness function in liveness-core.mjs receives the extracted data and applies a strictly ordered series of checks. Each check returns immediately upon matching, producing a result object with three properties: result (expired, active, or uncertain), code (a machine-readable reason), and reason (a human-readable explanation).
Classification Rules: From Expired to Active
The classifier evaluates conditions in priority order, allowing hard failures to short-circuit before UI analysis begins.
HTTP-Level Failures
http_gone: If the response status is404or410, the posting is immediately marked expired.
Redirect Analysis
expired_url: IffinalUrlmatches any pattern inEXPIRED_URL_PATTERNS(such as?error=trueor common job-board error paths), the posting is expired.
Content Heuristics
expired_body: IfbodyTextmatchesHARD_EXPIRED_PATTERNScontaining phrases like "job no longer available" or "position has been filled", the result is expired.listing_page: If the content matchesLISTING_PAGE_PATTERNS(e.g., "12 jobs found"), the posting is expired because the specific job redirected to a generic search results page.insufficient_content: IfbodyTextcontains fewer thanMIN_CONTENT_CHARS(default 300), the posting is expired as the page likely contains only navigation footers or error shells.
Active Indicators
apply_control_visible: IfhasApplyControl(applyControls)detects visible apply buttons or links, the posting is marked active and all further checks are skipped.
Fallback
no_apply_control: If none of the above conditions match, the posting is uncertain and flagged for manual review or retry.
Integration with the Scan Workflow
When users run node scan.mjs --verify, the verifyOffers helper iterates through discovered URLs and invokes the two-stage pipeline. Results are bucketed into three distinct actions:
- Active: Postings classified as
activeproceed to theverifiedpipeline and are shown to users. - Expired: Postings with
result: 'expired'are written toscan-history.tsvwith statusskipped_expiredand removed from the queue. - Uncertain: Postings with
result: 'uncertain'and no apply control are treated asdroppedand logged asskipped_no_apply_control. Those failing due to navigation errors remain uncertain and are retried on subsequent scans.
This guarantees that only URLs exposing a genuine apply mechanism survive to the user-facing output.
Safety Guards and Input Validation
Before Playwright consumes resources, rejectPrivateOrInvalid in liveness-browser.mjs blocks potentially dangerous or invalid inputs. It returns a guard result for:
- Non-HTTP protocols (
unsupported_protocol). - Localhost or private network addresses (
blocked_host). - Malformed URLs (
invalid_url).
Guarded URLs never reach the classification stage, preventing the browser from attempting to load internal resources or unsupported schemes.
Practical Code Examples
You can manually verify a single URL using the checkUrlLiveness driver:
import { chromium } from 'playwright';
import { checkUrlLiveness } from './liveness-browser.mjs';
async function demo(url) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
const { result, code, reason } = await checkUrlLiveness(page, url);
console.log(`${url} → ${result} (${code}): ${reason}`);
await browser.close();
}
demo('https://example.com/job/123');
The core classification logic inside classifyLiveness follows this simplified structure:
// Core classification snippet (simplified)
function classifyLiveness({status, finalUrl, bodyText, applyControls}) {
if (status === 404) return {result:'expired', code:'http_gone'};
if (EXPIRED_URL_PATTERNS.some(p=>p.test(finalUrl))) return {result:'expired', code:'expired_url'};
if (HARD_EXPIRED_PATTERNS.some(p=>p.test(bodyText))) return {result:'expired', code:'expired_body'};
if (applyControls.length) return {result:'active', code:'apply_control_visible'};
// …fallback checks omitted for brevity
}
Summary
- Playwright-based liveness verification in
santifer/career-opsextracts page data viacheckUrlLivenessinliveness-browser.mjsbefore applying deterministic rules inclassifyLivenessfromliveness-core.mjs. - The system checks HTTP status, redirect patterns, expired-content keywords, and listing-page indicators before confirming activity through visible apply controls.
- Results are bucketed into
active,expired, oruncertainoutcomes, with only active postings proceeding to the user-facing pipeline. - Safety guards in
rejectPrivateOrInvalidprevent the browser from accessing private networks or invalid protocols.
Frequently Asked Questions
What does "uncertain" mean in Playwright-based liveness verification?
An uncertain classification indicates that the crawler loaded the page successfully but found none of the definitive markers for either expiration or activity. This typically occurs when the page lacks an apply button but also lacks explicit "job filled" language or HTTP errors. These URLs are logged as skipped_no_apply_control and excluded from results unless they show an apply control on retry.
How does Career-Ops prevent false positives when a job is actually active?
The classifier prioritizes positive signals. If checkUrlLiveness detects visible apply controls in the applyControls array, it immediately returns active with code apply_control_visible, bypassing all expiration checks. This ensures that functional application buttons—regardless of page text content—are treated as authoritative evidence that the posting is live.
Can I run liveness checks without triggering a full scan?
Yes. The repository includes check-liveness.mjs, a thin CLI wrapper around checkUrlLiveness. You can invoke this script directly against specific URLs to debug classification results or verify individual postings without executing the full scan.mjs pipeline with the --verify flag.
Why does the system use 300 characters as the minimum content threshold?
The MIN_CONTENT_CHARS constant (default 300) filters out pages that render only navigation headers, footers, or generic error shells. When a job posting is removed, some sites return a valid HTTP 200 status with nearly empty HTML. By requiring at least 300 characters of body text, the insufficient_content rule catches these hollow responses and marks them expired before they reach the user.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →