How scan.mjs Deduplicates Job Offers Against scan-history.tsv in Career-Ops
scan.mjs prevents duplicate job postings by loading every previously seen URL from data/scan-history.tsv into a Set and skipping any incoming job whose URL or company-role pair already exists in that set.
The career-ops repository by santifer automates job-search tracking by scanning multiple job portals and aggregating new offers into a curated pipeline. Central to this automation is a robust deduplication mechanism inside scan.mjs that cross-references persistent history and active trackers to ensure no URL is ever processed twice. Understanding how scan.mjs deduplicates against scan-history.tsv is key to maintaining a clean, duplicate-free job pipeline.
Building the Deduplication Set
Before the scan begins, scan.mjs constructs a comprehensive memory-based index of every job it has already encountered. This is handled by the loadSeenUrls() function in scan.mjs.
Loading Historic URLs from scan-history.tsv
The primary source of truth is data/scan-history.tsv. The loadSeenUrls() function checks if this file exists, reads it as UTF-8, splits it by newlines, and extracts the URL from the first column of each row (skipping the header). Every extracted URL is added to a Set called seen:
function loadSeenUrls() {
const seen = new Set();
// scan-history.tsv – first column = URL
if (existsSync(SCAN_HISTORY_PATH)) {
const lines = readFileSync(SCAN_HISTORY_PATH, 'utf-8').split('\n');
for (const line of lines.slice(1)) { // skip header
const url = line.split('\t')[0];
if (url) seen.add(url);
}
}
// pipeline.md – URLs in checkbox lines
if (existsSync(PIPELINE_PATH)) {
const text = readFileSync(PIPELINE_PATH, 'utf-8');
for (const match of text.matchAll(/- \[[ x]\] (https?:\/\/\S+)/g)) {
seen.add(match[1]);
}
}
// applications.md – any inline URL
if (existsSync(APPLICATIONS_PATH)) {
const text = readFileSync(APPLICATIONS_PATH, 'utf-8');
for (const match of text.matchAll(/https?:\/\/[^\s|)]+/g)) {
seen.add(match[0]);
}
}
return seen;
}
Because a Set provides O(1) lookups, this keeps duplicate checks fast even as the history file grows.
Indexing Pipeline and Application Trackers
To reflect the entire system state—not just past scans—loadSeenUrls() also ingests URLs from data/pipeline.md and data/applications.md. It scans pipeline.md for markdown checkbox lines containing URLs (- [ ] http... or - [x] http...) and scans applications.md for any inline HTTP link. Both sources feed into the same seen set, ensuring that a job added manually to the pipeline or application tracker will never be reintroduced by an automated scan.
The Deduplication Flow During Scanning
Once the sets are built, the main provider loop iterates over every returned job. Each candidate passes through two filters before it is considered new.
URL-Level Deduplication
For every job object returned by a provider, scan.mjs checks if (seenUrls.has(job.url)). If the URL is already in the set, the job is counted as a duplicate with totalDupes++ and immediately skipped via continue:
for (const job of jobs) {
// …title & location filters omitted…
// 1️⃣ URL deduplication
if (seenUrls.has(job.url)) {
totalDupes++;
continue; // skip this job
}
// 2️⃣ Company-role deduplication
const key = `${job.company.toLowerCase()}::${job.title.toLowerCase()}`;
if (seenCompanyRoles.has(key)) {
totalDupes++;
continue; // skip this job
}
// Mark as seen immediately to avoid intra-scan repeats
seenUrls.add(job.url);
seenCompanyRoles.add(key);
newOffers.push({ ...job, source: sourceName });
}
This is the first and most critical gate in the deduplication pipeline.
Company-Role Deduplication
A secondary guard prevents the same role at the same company from being added under a different URL. A separate loadSeenCompanyRoles() function parses data/applications.md to build a set of company::role keys. Inside the loop, scan.mjs constructs a lowercase key from job.company and job.title and checks it against seenCompanyRoles. This catches duplicate postings that might have unique tracking URLs or referral parameters.
In-Scan Bookkeeping
Notice that the loop immediately updates both sets as soon as a job clears the filters:
// Mark as seen immediately to avoid intra-scan repeats
seenUrls.add(job.url);
seenCompanyRoles.add(key);
newOffers.push({ ...job, source: sourceName });
This in-scan bookkeeping prevents two different providers from introducing the same posting within a single execution.
Persisting New URLs to scan-history.tsv
After a new offer passes verification and filtering, its metadata is permanently recorded so future runs will recognize it. The appendToScanHistory() function in scan.mjs appends each accepted job to data/scan-history.tsv, creating the file with a TSV header if it does not already exist:
function appendToScanHistory(offers, date, status = 'added') {
if (!existsSync(SCAN_HISTORY_PATH)) {
writeFileSync(
SCAN_HISTORY_PATH,
'url\tfirst_seen\tportal\ttitle\tcompany\tstatus\tlocation\n',
'utf-8'
);
}
const lines = offers.map(o =>
`${o.url}\t${date}\t${o.source}\t${o.title}\t${o.company}\t${status}\t${o.location || ''}`
).join('\n') + '\n';
appendFileSync(SCAN_HISTORY_PATH, lines, 'utf-8');
}
By writing the URL into the first column of scan-history.tsv, the system ensures that loadSeenUrls() will include it in the deduplication set on the next execution.
Summary
loadSeenUrls()seeds aSetwith URLs fromdata/scan-history.tsv,data/pipeline.md, anddata/applications.mdto establish a unified view of all previously seen jobs.- The main scan loop uses
seenUrls.has(job.url)to block exact URL duplicates before they reach the pipeline. - A secondary
seenCompanyRoles.has(key)check prevents the same company-role pair from being added under a different link. - In-scan updates to both sets stop duplicate providers from introducing the same posting within a single execution.
- Verified offers are persisted back to
data/scan-history.tsvviaappendToScanHistory(), closing the loop for future deduplication.
Frequently Asked Questions
How does scan.mjs know which column contains the URL in scan-history.tsv?
loadSeenUrls() treats the first column as the URL by splitting each line on tabs and reading index 0. This matches the TSV header written by appendToScanHistory(), which defines url as the initial field.
What happens if scan-history.tsv does not exist yet?
If data/scan-history.tsv is missing, loadSeenUrls() skips the history load and returns a set containing only URLs found in pipeline.md and applications.md. When the first verified offers are processed, appendToScanHistory() creates the file and writes the tab-separated header row automatically.
Why does scan.mjs check pipeline.md and applications.md in addition to scan-history.tsv?
Manual edits to pipeline.md or applications.md might add jobs that have never been through an automated scan. By ingesting URLs from all three files, scan.mjs guarantees deduplication against the entire system state rather than just its own execution history.
Can two different job URLs for the same role at the same company both enter the pipeline?
No. The company::role deduplication key—built from lowercase job.company and job.title—blocks the second posting even if its URL is unique. This is enforced by the seenCompanyRoles set populated from data/applications.md.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →