How scan.mjs Handles Deduplication Across pipeline.md, applications.md, and scan-history.tsv

scan.mjs prevents duplicate job offers by building two in-memory sets at startup—one tracking every URL from scan-history.tsv, pipeline.md, and applications.md, and another tracking normalized company-role pairs from applications.md—filtering new offers against these sets before writing to the pipeline.

The santifer/career-ops repository automates job search tracking by scanning multiple providers and aggregating results into markdown files. To ensure idempotent scans and prevent the same position from appearing across your pipeline, application history, and scan archive, the scan.mjs script implements a robust two-layer deduplication system that checks both URLs and semantic company-role pairs.

The Two-Layer Deduplication Architecture

The deduplication strategy relies on two distinct in-memory Set objects constructed at runtime. This dual approach catches exact duplicates by URL while also preventing the same role at the same company from re-entering the pipeline under a different link, as implemented in the santifer/career-ops source code.

Building the Deduplication Sets

Before processing any new offers, the script loads historical state into memory.

Loading Seen URLs from Three Sources

The loadSeenUrls function (defined at line 78 in scan.mjs) populates a Set containing every URL previously processed. According to the source code, it harvests URLs from three distinct files:

  • data/scan-history.tsv: The historical archive recording every job URL ever scanned
  • data/pipeline.md: Currently pending offers awaiting action
  • data/applications.md: Previously submitted applications

Extracting Company-Role Pairs from applications.md

The loadSeenCompanyRoles function (line 109) creates a second Set tracking normalized company::role strings extracted exclusively from the markdown table in data/applications.md. The normalization logic concatenates lowercase values: company.toLowerCase() + '::' + title.toLowerCase().

Runtime Filtering Logic

During the main scan loop, each job offer returned by a provider passes through three sequential filters: title validation, location validation, and deduplication.

The deduplication step performs two specific checks:

  1. URL Existence Check (lines 71-73): The script verifies if seenUrls.has(job.url) returns true. If the URL exists in any of the three source files, the offer is skipped immediately.
  2. Company-Role Pair Check (lines 75-79): For offers passing the first filter, the script constructs the normalized key and checks against seenCompanyRoles.has(key). This prevents the same position from re-entering via a different job board URL.
  3. Intra-Scan Deduplication: Offers surviving both checks are immediately added to the in-memory sets via seenUrls.add(job.url) and seenCompanyRoles.add(key), ensuring duplicates appearing later in the same scan batch are also filtered.

Persisting Filtered Results

After filtering completes, only verified unique offers reach the persistence layer. The appendToPipeline function (lines 108-110) writes these to data/pipeline.md, while appendToScanHistory (lines 58-72) appends them to data/scan-history.tsv. Because deduplication occurs before these calls, the scan remains idempotent—running the script multiple times produces the same result without duplicating entries across your tracking files.

Command-Line Usage Examples

Run the script with the following options to control the deduplication-aware pipeline:


# Standard scan: new unique offers appended to pipeline.md

node scan.mjs

# Preview mode: see what would be added without touching files

node scan.mjs --dry-run

# Verify URLs with Playwright before writing (duplicates filtered first)

node scan.mjs --verify

Summary

  • loadSeenUrls aggregates URLs from scan-history.tsv, pipeline.md, and applications.md into an in-memory Set at line 78.
  • loadSeenCompanyRoles extracts normalized company::role pairs from applications.md at line 109.
  • Two-layer filtering checks exact URL matches first, then normalized company-role combinations during the main loop.
  • Intra-scan protection adds new offers to runtime sets immediately after validation to catch duplicates within the same batch.
  • Idempotent writes ensure only non-duplicate offers reach appendToPipeline and appendToScanHistory.

Frequently Asked Questions

How does scan.mjs handle duplicates within the same scan batch?

The script adds every validated offer to the in-memory seenUrls and seenCompanyRoles sets immediately after filtering (lines 71-79). This ensures that if the same job appears twice from different providers in a single scan, the second occurrence is caught by the runtime sets and filtered out before reaching the persistence layer.

What happens if a job URL exists in pipeline.md but not in applications.md?

The loadSeenUrls function harvests URLs from all three files including data/pipeline.md. If a URL exists in the pending pipeline, it will be present in the deduplication set at line 78, preventing the same URL from being re-added regardless of whether it has been applied to yet.

Does the --dry-run flag skip the deduplication checks?

No. The --dry-run flag prevents file writes but does not bypass deduplication logic. The script still builds the full deduplication sets and filters offers accordingly, allowing you to preview exactly which new, non-duplicate offers would be appended to pipeline.md without modifying any files in santifer/career-ops.

Why check both URL and company-role combinations?

The URL check catches exact duplicates across all files, while the company-role pair check prevents the same position from re-entering under a different URL (for example, when a company cross-posts the same role to multiple job boards). This dual validation ensures comprehensive deduplication across the entire career tracking workflow.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →