how-to-guide

How Pipeline Integrity Verification Detects Duplicates and Normalizes Application Statuses in Career-Ops

June 10, 2026 santifer/career-ops ↗

The dedup-tracker.mjs script identifies duplicate job applications by grouping entries using normalized company names and fuzzy role matching, then consolidates them by selecting the highest-scoring entry and promoting its status to the most advanced stage found across all duplicates.

The santifer/career-ops repository maintains a single source of truth for job tracking in applications.md. The pipeline integrity verification process ensures data consistency by programmatically detecting duplicate entries and normalizing their statuses through a deterministic scoring and ranking system.

Normalizing Company Names for Exact Grouping

The verification process begins by standardizing company names to create canonical keys for grouping. In dedup-tracker.mjs, the normalizeCompany() function (lines 48‑51) transforms raw employer strings into comparable identifiers:

function normalizeCompany(name) {
  return name.toLowerCase()
    .replace(/[()]/g, '')
    .replace(/\s+/g, ' ')
    .replace(/[^a-z0-9 ]/g, '')
    .trim();
}

This normalization converts text to lowercase, strips parentheses, collapses multiple whitespace characters, removes non-alphanumeric symbols, and trims edges. The resulting string serves as the key for a Map object that aggregates all entries belonging to the same employer.

Normalizing Role Titles and Removing Stop-Words

After grouping by company, the script normalizes role titles to enable fuzzy comparison. The normalizeRole() function performs similar cleaning but preserves forward slashes and converts parentheses to spaces:

function normalizeRole(role) {
  return role.toLowerCase()
    .replace(/[()]/g, ' ')
    .replace(/\s+/g, ' ')
    .replace(/[^a-z0-9 /]/g, '')
    .trim();
}

The script then removes semantic noise using stop-word sets defined at lines 68‑80. ROLE_STOPWORDS and LOCATION_STOPWORDS contain terms like “senior,” “remote,” “engineer,” and geographic identifiers that don’t distinguish between otherwise similar positions. These filters ensure that titles like “Senior Software Engineer (Remote)” and “Software Engineer” can still match if their core descriptors overlap.

Fuzzy Role Matching Algorithm

The roleMatch() function (lines 82‑95) implements a token-based similarity algorithm to determine if two role titles describe the same position:

function roleMatch(a, b) {
  const filterStopwords = (words) =>
    words.filter(w => !ROLE_STOPWORDS.has(w) && !LOCATION_STOPWORDS.has(w));

  const wordsA = filterStopwords(normalizeRole(a).split(/\s+/).filter(w => w.length > 2));
  const wordsB = filterStopwords(normalizeRole(b).split(/\s+/).filter(w => w.length > 2));

  if (wordsA.length === 0 || wordsB.length === 0) return false;

  const overlap = wordsA.filter(w => wordsB.some(wb => wb === w));
  const smaller = Math.min(wordsA.length, wordsB.length);
  const ratio = overlap.length / smaller;

  return overlap.length >= 2 && ratio >= 0.6;
}

The algorithm first filters out stop-words and drops words shorter than three characters. It then calculates the intersection between the two word sets. For a match to occur, the titles must share at least two common words and achieve a minimum overlap ratio of 60 % relative to the smaller word set.

Grouping Entries and Selecting the Keeper

Once company groups are established, the script iterates through each group to build clusters of role-matching entries. Within each cluster, the system selects a single keeper entry based on numeric scoring. The parseScore function (lines 98‑101) extracts score values from entries, and the clustering logic (lines 77‑80) sorts matches to prioritize the entry with the highest score. This ensures that the most complete or recently updated record becomes the canonical entry for that position.

Status Normalization Strategy

Career-Ops defines a hierarchical progression of application stages in the STATUS_RANK mapping (lines 26‑50), where higher numeric values represent more advanced pipeline steps. When processing a duplicate cluster, the script performs three normalization steps:

Rank Discovery: The algorithm records the highest status rank among all entries in the cluster (bestStatusRank, lines 82‑89).
Status Promotion: If any discarded duplicate has a more advanced status than the keeper, the keeper’s status is upgraded to that higher rank (lines 92‑100).
In-Place Update: The status field of the keeper row is rewritten directly in the lines array (lines 96‑99).

This guarantees that the surviving entry reflects the furthest progress achieved across all duplicate submissions.

Removing Duplicate Entries from the Tracker

After identifying clusters and selecting keepers, the script collects all line numbers belonging to duplicate rows in the linesToRemove array. The removal process (lines 16‑20) splices these indices from the master lines array. If executed without the --dry-run flag, the script creates a backup at applications.md.bak before writing the cleaned data back to the original file (lines 24‑28).

Running the Verification Pipeline

The integrity check can be invoked standalone or as part of the broader doctor.mjs orchestration. Use the following commands to execute the deduplication process:


# Preview changes without modifying the file

node dedup-tracker.mjs --dry-run

# Apply changes with automatic backup

node dedup-tracker.mjs

Programmatic invocation from another Node.js module:

import { execSync } from 'child_process';

// Run verification as part of a larger workflow
execSync('node dedup-tracker.mjs --dry-run', { stdio: 'inherit' });

The canonical status labels referenced by STATUS_RANK are defined in templates/states.yml, while doctor.mjs serves as the primary entry point for running comprehensive pipeline integrity checks that include this deduplication routine.

Summary

Company normalization uses normalizeCompany() to create canonical Map keys by lowercasing, stripping punctuation, and removing non-alphanumeric characters.
Role normalization via normalizeRole() and stop-word filters prepares titles for fuzzy comparison by removing semantic noise like seniority levels and locations.
Fuzzy matching requires at least two overlapping words with a 60 % overlap ratio to consider roles identical.
Keeper selection prioritizes the entry with the highest numeric score extracted by parseScore().
Status promotion utilizes STATUS_RANK to upgrade the keeper’s status to the most advanced stage found among its duplicates.
Safe deletion removes duplicate lines after creating a backup file, with --dry-run support for safe testing.

Frequently Asked Questions

How does the script determine if two role titles are duplicates?

The roleMatch() function normalizes both titles, removes stop-words, and filters out words shorter than three characters. It then counts the overlapping words between the two sets. If there are at least two common words and the overlap ratio is 60 % or higher relative to the smaller set, the roles are considered duplicates.

What happens to the status of a duplicate entry that is further along in the hiring process?

The script examines the STATUS_RANK of all entries in a duplicate cluster. If any discarded entry has a higher rank (more advanced status) than the keeper, the keeper’s status is automatically promoted to that higher rank. This ensures the tracker reflects the furthest pipeline progress achieved for that application.

Can I preview changes before the script modifies applications.md?

Yes. Running node dedup-tracker.mjs --dry-run executes the full detection and normalization logic without writing changes to disk. The script outputs the analysis to the console, showing which entries would be removed and which statuses would be promoted, allowing you to verify the integrity check before committing changes.

Where are the status rankings defined in the repository?

The canonical status labels and their hierarchical rankings are defined in templates/states.yml. The STATUS_RANK map in dedup-tracker.mjs (lines 26‑50) references these definitions to determine which stage represents further progress in the application pipeline.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how santifer/career-ops works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →