# How scan.mjs Deduplicates Job Offers Against scan-history.tsv in Career-Ops

> Learn how scan.mjs efficiently deduplicates job offers using scan-history.tsv. Discover the Set-based approach to prevent duplicate postings in career-ops.

- Repository: [Santiago Fernández de Valderrama/career-ops](https://github.com/santifer/career-ops)
- Tags: how-to-guide
- Published: 2026-06-09

---

**`scan.mjs` prevents duplicate job postings by loading every previously seen URL from `data/scan-history.tsv` into a `Set` and skipping any incoming job whose URL or company-role pair already exists in that set.**

The `career-ops` repository by santifer automates job-search tracking by scanning multiple job portals and aggregating new offers into a curated pipeline. Central to this automation is a robust **deduplication mechanism** inside `scan.mjs` that cross-references persistent history and active trackers to ensure no URL is ever processed twice. Understanding how `scan.mjs` deduplicates against `scan-history.tsv` is key to maintaining a clean, duplicate-free job pipeline.

## Building the Deduplication Set

Before the scan begins, `scan.mjs` constructs a comprehensive memory-based index of every job it has already encountered. This is handled by the `loadSeenUrls()` function in `scan.mjs`.

### Loading Historic URLs from scan-history.tsv

The primary source of truth is `data/scan-history.tsv`. The `loadSeenUrls()` function checks if this file exists, reads it as UTF-8, splits it by newlines, and extracts the URL from the first column of each row (skipping the header). Every extracted URL is added to a `Set` called `seen`:

```js
function loadSeenUrls() {
  const seen = new Set();

  // scan-history.tsv – first column = URL
  if (existsSync(SCAN_HISTORY_PATH)) {
    const lines = readFileSync(SCAN_HISTORY_PATH, 'utf-8').split('\n');
    for (const line of lines.slice(1)) {          // skip header
      const url = line.split('\t')[0];
      if (url) seen.add(url);
    }
  }

  // pipeline.md – URLs in checkbox lines
  if (existsSync(PIPELINE_PATH)) {
    const text = readFileSync(PIPELINE_PATH, 'utf-8');
    for (const match of text.matchAll(/- \[[ x]\] (https?:\/\/\S+)/g)) {
      seen.add(match[1]);
    }
  }

  // applications.md – any inline URL
  if (existsSync(APPLICATIONS_PATH)) {
    const text = readFileSync(APPLICATIONS_PATH, 'utf-8');
    for (const match of text.matchAll(/https?:\/\/[^\s|)]+/g)) {
      seen.add(match[0]);
    }
  }

  return seen;
}

```

Because a `Set` provides O(1) lookups, this keeps duplicate checks fast even as the history file grows.

### Indexing Pipeline and Application Trackers

To reflect the entire system state—not just past scans—`loadSeenUrls()` also ingests URLs from [`data/pipeline.md`](https://github.com/santifer/career-ops/blob/main/data/pipeline.md) and [`data/applications.md`](https://github.com/santifer/career-ops/blob/main/data/applications.md). It scans [`pipeline.md`](https://github.com/santifer/career-ops/blob/main/pipeline.md) for markdown checkbox lines containing URLs (`- [ ] http...` or `- [x] http...`) and scans [`applications.md`](https://github.com/santifer/career-ops/blob/main/applications.md) for any inline HTTP link. Both sources feed into the same `seen` set, ensuring that a job added manually to the pipeline or application tracker will never be reintroduced by an automated scan.

## The Deduplication Flow During Scanning

Once the sets are built, the main provider loop iterates over every returned job. Each candidate passes through two filters before it is considered new.

### URL-Level Deduplication

For every `job` object returned by a provider, `scan.mjs` checks `if (seenUrls.has(job.url))`. If the URL is already in the set, the job is counted as a duplicate with `totalDupes++` and immediately skipped via `continue`:

```js
for (const job of jobs) {
  // …title & location filters omitted…

  // 1️⃣ URL deduplication
  if (seenUrls.has(job.url)) {
    totalDupes++;
    continue;               // skip this job
  }

  // 2️⃣ Company-role deduplication
  const key = `${job.company.toLowerCase()}::${job.title.toLowerCase()}`;
  if (seenCompanyRoles.has(key)) {
    totalDupes++;
    continue;               // skip this job
  }

  // Mark as seen immediately to avoid intra-scan repeats
  seenUrls.add(job.url);
  seenCompanyRoles.add(key);
  newOffers.push({ ...job, source: sourceName });
}

```

This is the first and most critical gate in the deduplication pipeline.

### Company-Role Deduplication

A secondary guard prevents the same role at the same company from being added under a different URL. A separate `loadSeenCompanyRoles()` function parses [`data/applications.md`](https://github.com/santifer/career-ops/blob/main/data/applications.md) to build a set of `company::role` keys. Inside the loop, `scan.mjs` constructs a lowercase key from `job.company` and `job.title` and checks it against `seenCompanyRoles`. This catches duplicate postings that might have unique tracking URLs or referral parameters.

### In-Scan Bookkeeping

Notice that the loop immediately updates both sets as soon as a job clears the filters:

```js
  // Mark as seen immediately to avoid intra-scan repeats
  seenUrls.add(job.url);
  seenCompanyRoles.add(key);
  newOffers.push({ ...job, source: sourceName });

```

This in-scan bookkeeping prevents two different providers from introducing the same posting within a single execution.

## Persisting New URLs to scan-history.tsv

After a new offer passes verification and filtering, its metadata is permanently recorded so future runs will recognize it. The `appendToScanHistory()` function in `scan.mjs` appends each accepted job to `data/scan-history.tsv`, creating the file with a TSV header if it does not already exist:

```js
function appendToScanHistory(offers, date, status = 'added') {
  if (!existsSync(SCAN_HISTORY_PATH)) {
    writeFileSync(
      SCAN_HISTORY_PATH,
      'url\tfirst_seen\tportal\ttitle\tcompany\tstatus\tlocation\n',
      'utf-8'
    );
  }

  const lines = offers.map(o =>
    `${o.url}\t${date}\t${o.source}\t${o.title}\t${o.company}\t${status}\t${o.location || ''}`
  ).join('\n') + '\n';

  appendFileSync(SCAN_HISTORY_PATH, lines, 'utf-8');
}

```

By writing the URL into the first column of `scan-history.tsv`, the system ensures that `loadSeenUrls()` will include it in the deduplication set on the next execution.

## Summary

- **`loadSeenUrls()`** seeds a `Set` with URLs from `data/scan-history.tsv`, [`data/pipeline.md`](https://github.com/santifer/career-ops/blob/main/data/pipeline.md), and [`data/applications.md`](https://github.com/santifer/career-ops/blob/main/data/applications.md) to establish a unified view of all previously seen jobs.
- The main scan loop uses **`seenUrls.has(job.url)`** to block exact URL duplicates before they reach the pipeline.
- A secondary **`seenCompanyRoles.has(key)`** check prevents the same company-role pair from being added under a different link.
- **In-scan updates** to both sets stop duplicate providers from introducing the same posting within a single execution.
- Verified offers are persisted back to **`data/scan-history.tsv`** via **`appendToScanHistory()`**, closing the loop for future deduplication.

## Frequently Asked Questions

### How does scan.mjs know which column contains the URL in scan-history.tsv?

`loadSeenUrls()` treats the first column as the URL by splitting each line on tabs and reading index `0`. This matches the TSV header written by `appendToScanHistory()`, which defines `url` as the initial field.

### What happens if scan-history.tsv does not exist yet?

If `data/scan-history.tsv` is missing, `loadSeenUrls()` skips the history load and returns a set containing only URLs found in [`pipeline.md`](https://github.com/santifer/career-ops/blob/main/pipeline.md) and [`applications.md`](https://github.com/santifer/career-ops/blob/main/applications.md). When the first verified offers are processed, `appendToScanHistory()` creates the file and writes the tab-separated header row automatically.

### Why does scan.mjs check pipeline.md and applications.md in addition to scan-history.tsv?

Manual edits to [`pipeline.md`](https://github.com/santifer/career-ops/blob/main/pipeline.md) or [`applications.md`](https://github.com/santifer/career-ops/blob/main/applications.md) might add jobs that have never been through an automated scan. By ingesting URLs from all three files, `scan.mjs` guarantees deduplication against the entire system state rather than just its own execution history.

### Can two different job URLs for the same role at the same company both enter the pipeline?

No. The `company::role` deduplication key—built from lowercase `job.company` and `job.title`—blocks the second posting even if its URL is unique. This is enforced by the `seenCompanyRoles` set populated from [`data/applications.md`](https://github.com/santifer/career-ops/blob/main/data/applications.md).