How scan.mjs Handles Deduplication Across pipeline.md, applications.md, and scan-history.tsv
scan.mjs prevents duplicate job offers by building two in-memory sets at startup—one tracking every URL from scan-history.tsv, pipeline.md, and applications.md, and another tracking normalized company-role pairs from applications.md—filtering new offers against these sets before writing to the pipeline.
The santifer/career-ops repository automates job search tracking by scanning multiple providers and aggregating results into markdown files. To ensure idempotent scans and prevent the same position from appearing across your pipeline, application history, and scan archive, the scan.mjs script implements a robust two-layer deduplication system that checks both URLs and semantic company-role pairs.
The Two-Layer Deduplication Architecture
The deduplication strategy relies on two distinct in-memory Set objects constructed at runtime. This dual approach catches exact duplicates by URL while also preventing the same role at the same company from re-entering the pipeline under a different link, as implemented in the santifer/career-ops source code.
Building the Deduplication Sets
Before processing any new offers, the script loads historical state into memory.
Loading Seen URLs from Three Sources
The loadSeenUrls function (defined at line 78 in scan.mjs) populates a Set containing every URL previously processed. According to the source code, it harvests URLs from three distinct files:
data/scan-history.tsv: The historical archive recording every job URL ever scanneddata/pipeline.md: Currently pending offers awaiting actiondata/applications.md: Previously submitted applications
Extracting Company-Role Pairs from applications.md
The loadSeenCompanyRoles function (line 109) creates a second Set tracking normalized company::role strings extracted exclusively from the markdown table in data/applications.md. The normalization logic concatenates lowercase values: company.toLowerCase() + '::' + title.toLowerCase().
Runtime Filtering Logic
During the main scan loop, each job offer returned by a provider passes through three sequential filters: title validation, location validation, and deduplication.
The deduplication step performs two specific checks:
- URL Existence Check (lines 71-73): The script verifies if
seenUrls.has(job.url)returns true. If the URL exists in any of the three source files, the offer is skipped immediately. - Company-Role Pair Check (lines 75-79): For offers passing the first filter, the script constructs the normalized key and checks against
seenCompanyRoles.has(key). This prevents the same position from re-entering via a different job board URL. - Intra-Scan Deduplication: Offers surviving both checks are immediately added to the in-memory sets via
seenUrls.add(job.url)andseenCompanyRoles.add(key), ensuring duplicates appearing later in the same scan batch are also filtered.
Persisting Filtered Results
After filtering completes, only verified unique offers reach the persistence layer. The appendToPipeline function (lines 108-110) writes these to data/pipeline.md, while appendToScanHistory (lines 58-72) appends them to data/scan-history.tsv. Because deduplication occurs before these calls, the scan remains idempotent—running the script multiple times produces the same result without duplicating entries across your tracking files.
Command-Line Usage Examples
Run the script with the following options to control the deduplication-aware pipeline:
# Standard scan: new unique offers appended to pipeline.md
node scan.mjs
# Preview mode: see what would be added without touching files
node scan.mjs --dry-run
# Verify URLs with Playwright before writing (duplicates filtered first)
node scan.mjs --verify
Summary
loadSeenUrlsaggregates URLs fromscan-history.tsv,pipeline.md, andapplications.mdinto an in-memory Set at line 78.loadSeenCompanyRolesextracts normalizedcompany::rolepairs fromapplications.mdat line 109.- Two-layer filtering checks exact URL matches first, then normalized company-role combinations during the main loop.
- Intra-scan protection adds new offers to runtime sets immediately after validation to catch duplicates within the same batch.
- Idempotent writes ensure only non-duplicate offers reach
appendToPipelineandappendToScanHistory.
Frequently Asked Questions
How does scan.mjs handle duplicates within the same scan batch?
The script adds every validated offer to the in-memory seenUrls and seenCompanyRoles sets immediately after filtering (lines 71-79). This ensures that if the same job appears twice from different providers in a single scan, the second occurrence is caught by the runtime sets and filtered out before reaching the persistence layer.
What happens if a job URL exists in pipeline.md but not in applications.md?
The loadSeenUrls function harvests URLs from all three files including data/pipeline.md. If a URL exists in the pending pipeline, it will be present in the deduplication set at line 78, preventing the same URL from being re-added regardless of whether it has been applied to yet.
Does the --dry-run flag skip the deduplication checks?
No. The --dry-run flag prevents file writes but does not bypass deduplication logic. The script still builds the full deduplication sets and filters offers accordingly, allowing you to preview exactly which new, non-duplicate offers would be appended to pipeline.md without modifying any files in santifer/career-ops.
Why check both URL and company-role combinations?
The URL check catches exact duplicates across all files, while the company-role pair check prevents the same position from re-entering under a different URL (for example, when a company cross-posts the same role to multiple job boards). This dual validation ensures comprehensive deduplication across the entire career tracking workflow.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →