# How scan.mjs Handles Deduplication Across pipeline.md, applications.md, and scan-history.tsv

> Learn how scan.mjs deduplicates job offers by using in-memory sets to track URLs and company-role pairs from pipeline.md, applications.md, and scan-history.tsv, ensuring unique listings before writing.

- Repository: [Santiago Fernández de Valderrama/career-ops](https://github.com/santifer/career-ops)
- Tags: how-to-guide
- Published: 2026-06-07

---

**`scan.mjs` prevents duplicate job offers by building two in-memory sets at startup—one tracking every URL from `scan-history.tsv`, [`pipeline.md`](https://github.com/santifer/career-ops/blob/main/pipeline.md), and [`applications.md`](https://github.com/santifer/career-ops/blob/main/applications.md), and another tracking normalized company-role pairs from [`applications.md`](https://github.com/santifer/career-ops/blob/main/applications.md)—filtering new offers against these sets before writing to the pipeline.**

The `santifer/career-ops` repository automates job search tracking by scanning multiple providers and aggregating results into markdown files. To ensure idempotent scans and prevent the same position from appearing across your pipeline, application history, and scan archive, the `scan.mjs` script implements a robust two-layer deduplication system that checks both URLs and semantic company-role pairs.

## The Two-Layer Deduplication Architecture

The deduplication strategy relies on two distinct in-memory `Set` objects constructed at runtime. This dual approach catches exact duplicates by URL while also preventing the same role at the same company from re-entering the pipeline under a different link, as implemented in the `santifer/career-ops` source code.

## Building the Deduplication Sets

Before processing any new offers, the script loads historical state into memory.

### Loading Seen URLs from Three Sources

The `loadSeenUrls` function (defined at line 78 in `scan.mjs`) populates a `Set` containing every URL previously processed. According to the source code, it harvests URLs from three distinct files:

- **`data/scan-history.tsv`**: The historical archive recording every job URL ever scanned
- **[`data/pipeline.md`](https://github.com/santifer/career-ops/blob/main/data/pipeline.md)**: Currently pending offers awaiting action
- **[`data/applications.md`](https://github.com/santifer/career-ops/blob/main/data/applications.md)**: Previously submitted applications

### Extracting Company-Role Pairs from applications.md

The `loadSeenCompanyRoles` function (line 109) creates a second `Set` tracking normalized `company::role` strings extracted exclusively from the markdown table in [`data/applications.md`](https://github.com/santifer/career-ops/blob/main/data/applications.md). The normalization logic concatenates lowercase values: `company.toLowerCase() + '::' + title.toLowerCase()`.

## Runtime Filtering Logic

During the main scan loop, each job offer returned by a provider passes through three sequential filters: title validation, location validation, and deduplication.

The deduplication step performs two specific checks:

1. **URL Existence Check** (lines 71-73): The script verifies if `seenUrls.has(job.url)` returns true. If the URL exists in any of the three source files, the offer is skipped immediately.
2. **Company-Role Pair Check** (lines 75-79): For offers passing the first filter, the script constructs the normalized key and checks against `seenCompanyRoles.has(key)`. This prevents the same position from re-entering via a different job board URL.
3. **Intra-Scan Deduplication**: Offers surviving both checks are immediately added to the in-memory sets via `seenUrls.add(job.url)` and `seenCompanyRoles.add(key)`, ensuring duplicates appearing later in the same scan batch are also filtered.

## Persisting Filtered Results

After filtering completes, only verified unique offers reach the persistence layer. The `appendToPipeline` function (lines 108-110) writes these to [`data/pipeline.md`](https://github.com/santifer/career-ops/blob/main/data/pipeline.md), while `appendToScanHistory` (lines 58-72) appends them to `data/scan-history.tsv`. Because deduplication occurs before these calls, the scan remains idempotent—running the script multiple times produces the same result without duplicating entries across your tracking files.

## Command-Line Usage Examples

Run the script with the following options to control the deduplication-aware pipeline:

```bash

# Standard scan: new unique offers appended to pipeline.md

node scan.mjs

# Preview mode: see what would be added without touching files

node scan.mjs --dry-run

# Verify URLs with Playwright before writing (duplicates filtered first)

node scan.mjs --verify

```

## Summary

- **`loadSeenUrls`** aggregates URLs from `scan-history.tsv`, [`pipeline.md`](https://github.com/santifer/career-ops/blob/main/pipeline.md), and [`applications.md`](https://github.com/santifer/career-ops/blob/main/applications.md) into an in-memory Set at line 78.
- **`loadSeenCompanyRoles`** extracts normalized `company::role` pairs from [`applications.md`](https://github.com/santifer/career-ops/blob/main/applications.md) at line 109.
- **Two-layer filtering** checks exact URL matches first, then normalized company-role combinations during the main loop.
- **Intra-scan protection** adds new offers to runtime sets immediately after validation to catch duplicates within the same batch.
- **Idempotent writes** ensure only non-duplicate offers reach `appendToPipeline` and `appendToScanHistory`.

## Frequently Asked Questions

### How does scan.mjs handle duplicates within the same scan batch?

The script adds every validated offer to the in-memory `seenUrls` and `seenCompanyRoles` sets immediately after filtering (lines 71-79). This ensures that if the same job appears twice from different providers in a single scan, the second occurrence is caught by the runtime sets and filtered out before reaching the persistence layer.

### What happens if a job URL exists in pipeline.md but not in applications.md?

The `loadSeenUrls` function harvests URLs from all three files including [`data/pipeline.md`](https://github.com/santifer/career-ops/blob/main/data/pipeline.md). If a URL exists in the pending pipeline, it will be present in the deduplication set at line 78, preventing the same URL from being re-added regardless of whether it has been applied to yet.

### Does the --dry-run flag skip the deduplication checks?

No. The `--dry-run` flag prevents file writes but does not bypass deduplication logic. The script still builds the full deduplication sets and filters offers accordingly, allowing you to preview exactly which new, non-duplicate offers would be appended to [`pipeline.md`](https://github.com/santifer/career-ops/blob/main/pipeline.md) without modifying any files in `santifer/career-ops`.

### Why check both URL and company-role combinations?

The URL check catches exact duplicates across all files, while the company-role pair check prevents the same position from re-entering under a different URL (for example, when a company cross-posts the same role to multiple job boards). This dual validation ensures comprehensive deduplication across the entire career tracking workflow.