How the liveness-browser.mjs Classifier Distinguishes Expired from Active Job Postings
The liveness-browser.mjs classifier orchestrates Playwright to scrape job posting pages and delegates to liveness-core.mjs, which evaluates HTTP status codes, URL redirect patterns, body text regexes, apply button visibility, and content length thresholds to return a deterministic expired, active, or uncertain verdict.
The liveness-browser.mjs classifier in the santifer/career-ops repository automates the verification of job posting URLs by simulating real browser sessions. Understanding the liveness-browser.mjs classifier logic is essential for anyone building job board aggregators or ATS monitoring tools that need to filter out stale listings. This article breaks down the exact signals and decision rules used to categorize postings.
Overview of the Classification Pipeline
The classifier operates as a two-stage process. First, liveness-browser.mjs handles browser orchestration, safety checks, and data extraction. Then, it passes structured data to classifyLiveness() in liveness-core.mjs to execute the rule-based evaluation.
The pipeline extracts four critical data points from each page:
- HTTP response status captured during navigation
- Final URL after resolving all redirects
- Body text content for pattern matching
- Visible apply controls (buttons, links, or inputs that are not hidden and not inside navigation or footer elements)
The Six Signal Detection Rules
According to the source code in liveness-core.mjs, the classifier evaluates six distinct signals in priority order. Each signal triggers a specific result code when matched.
HTTP Status Code Validation
The classifier first inspects the HTTP response status. If the server returns 404 (Not Found) or 410 (Gone), the posting is immediately marked as expired.
Result: { result: 'expired', code: 'http_gone' }
Expired URL Pattern Matching
When job platforms remove listings, they often redirect to generic URLs containing fragments like "expired" or "closed". The classifier checks the final URL against EXPIRED_URL_PATTERNS.
Result: { result: 'expired', code: 'expired_url' }
Hard Expired Text Patterns
The classifier scans bodyText for definitive expiration phrases defined in HARD_EXPIRED_PATTERNS. These are hard-coded strings indicating the position is no longer available.
Result: { result: 'expired', code: 'expired_body' }
Apply Control Visibility
For active postings, the classifier looks for visible application elements. It filters the DOM for buttons, links, or inputs that:
- Are not hidden via CSS
- Have positive dimensions (width and height)
- Are not contained within
<nav>or<footer>elements
If any collected applyControls match APPLY_PATTERNS, the posting is considered active.
Result: { result: 'active', code: 'apply_control_visible' }
Listing Page Detection
Some Applicant Tracking Systems redirect removed postings to generic search results pages rather than showing a 404. The classifier identifies these by matching the final URL against LISTING_PAGE_PATTERNS.
Result: { result: 'expired', code: 'listing_page' }
Content Length Validation
If bodyText contains fewer than MIN_CONTENT_CHARS (300 characters), the page is likely just a navigation skeleton or error template without actual job content.
Result: { result: 'expired', code: 'insufficient_content' }
Fallback for Uncertain States
When none of the above signals trigger—meaning no apply controls are visible but no expiration signals are detected—the classifier reports uncertainty rather than guessing.
Result: { result: 'uncertain', code: 'no_apply_control' }
Implementation in liveness-browser.mjs
The liveness-browser.mjs module manages the browser lifecycle and data preparation before classification.
Safety Guards
Before navigation, the rejectPrivateOrInvalid() function blocks non-HTTP(S) protocols and private network hosts (localhost, 127.0.0.1, 192.168.x.x, etc.), preventing the scanner from hitting internal infrastructure.
Navigation Strategy
The module uses Playwright to load pages with a 15-second timeout, then waits an additional 2 seconds to allow for Single Page Application hydration. This ensures JavaScript-rendered content is fully loaded before extraction.
Data Extraction
After navigation completes, the script captures:
response.status()for HTTP code analysispage.url()for redirect trackingbodyTextvia page content extractionapplyControlsvia DOM querying that excludes hidden elements and navigation or footer containers
All collected data is packaged into an object and passed to classifyLiveness().
Using the Classifier in Your Code
You can integrate the classifier into existing pipelines using either the high-level browser wrapper or the core logic directly.
Checking a URL with Browser Automation
Use checkUrlLiveness() from liveness-browser.mjs when you need full browser rendering:
import { chromium } from 'playwright';
import { checkUrlLiveness } from './liveness-browser.mjs';
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
const url = 'https://jobs.example.com/role-software-engineer-12345';
const { result, code, reason } = await checkUrlLiveness(page, url);
console.log(`Liveness: ${result} (code: ${code}) – ${reason}`);
await browser.close();
})();
Direct Core Classification
For unit testing or server-side rendering scenarios where you already have page data, import classifyLiveness() from liveness-core.mjs:
import { classifyLiveness } from './liveness-core.mjs';
const sample = {
status: 200,
finalUrl: 'https://jobs.example.com/role-software-engineer-12345',
bodyText: '...apply now...',
applyControls: [{ text: 'Apply Now', tag: 'button' }]
};
const { result, code, reason } = classifyLiveness(sample);
// Output: "active", "apply_control_visible", "visible apply control detected"
Summary
- The
liveness-browser.mjsclassifier uses Playwright to gather page data and delegates evaluation toliveness-core.mjs. - Six primary signals determine status: HTTP status codes (
404/410), expired URL patterns, hard-coded expiration phrases, visible apply controls, listing page redirects, and minimum content length (300 characters). - Safety measures include private network blocking via
rejectPrivateOrInvalid()and SPA hydration waits (15s navigation + 2s wait). - Return values are standardized objects containing
result(expired,active, oruncertain),code(specific trigger identifier), andreason(human-readable explanation). - Integration options include high-level
checkUrlLiveness()for live URLs or directclassifyLiveness()for testing with static data.
Frequently Asked Questions
How does liveness-browser.mjs handle Single Page Applications?
The classifier accommodates SPAs by waiting 2 seconds after the initial navigation completes (which uses a 15-second timeout). This hydration delay allows JavaScript frameworks to render job content and apply buttons that would not exist in the initial HTML payload.
What is the minimum content length threshold, and why?
The classifier requires bodyText to contain at least MIN_CONTENT_CHARS (300 characters). Pages shorter than this threshold typically represent navigation shells, error pages, or generic redirects without actual job descriptions, triggering the insufficient_content expiration code.
Can I use the classifier without running a full browser?
Yes. While liveness-browser.mjs requires Playwright for live URL checking, you can import classifyLiveness() directly from liveness-core.mjs to classify pre-scraped data. This is useful for unit testing (as demonstrated in test-all.mjs) or when integrating with existing crawling infrastructure that already provides page content, status codes, and final URLs.
Why does the classifier return "uncertain" instead of defaulting to active or expired?
The uncertain result with code no_apply_control acts as a safety mechanism when no apply buttons are detected but no explicit expiration signals are present. This prevents false positives on pages with complex authentication requirements, heavy JavaScript that obscures controls, or unconventional ATS layouts that might hide apply elements until interaction, prompting manual review rather than automatic categorization.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →