Understanding the Plugin Architecture of local-parser.mjs for Custom Job Site Scraping

The local-parser.mjs module implements a provider-based plugin architecture that executes external scripts to scrape job listings locally, eliminating network overhead while maintaining a standardized JSON interface.

The Career-Ops scanner leverages this architecture to enable extensible job site scraping without modifying core source code. By treating local executables—written in Python, Node.js, Bash, or any CLI-capable language—as first-class providers, the system bypasses costly Playwright browser automation for sites that expose structured data endpoints.

Provider Contract and Type Safety

All providers must conform to the Provider type defined in providers/_types.js. This contract enforces a consistent interface requiring two methods:

  • detect(entry): Determines if the provider can handle a specific portal configuration
  • fetch(entry): Returns a promise resolving to an array of normalized job listings

This strict typing ensures that local-parser.mjs integrates seamlessly with the provider selection logic in scan.mjs, allowing the scanner to treat local scripts identically to network-based providers.

Configuration-Driven Script Binding

The plugin architecture relies on declarative configuration within portals.yml. Each portal entry can specify a parser block that binds an external script to a specific careers page:

- name: ExampleCo
  careers_url: https://example.com/careers
  parser:
    command: python3
    script: ./parsers/exampleco.py
    args: ['--url', '{careers_url}']
    timeout_ms: 20000
    max_buffer_bytes: 1000000

The command field specifies the executable interpreter, while script provides the optional path to the parser file. Placeholders like {careers_url} and {company} are dynamically expanded by expandParserArg (lines 13-17) before execution, allowing context-specific arguments to be passed to external scripts.

Detection and Execution Pipeline

The execution flow follows a three-phase pattern: detection, command building, and result normalization.

Provider Detection Logic

The detect(entry) method (lines 10-16) validates that a portal entry should use local parsing by checking two conditions: the presence of parser.command in the configuration and the existence of the referenced script file on disk. When successful, it returns either entry.careers_url or the literal string "local-parser", signaling to scan.mjs that this provider should handle the entry.

Command Execution and Process Management

Upon provider selection, fetch(entry) (lines 18-20) delegates to runLocalParser(entry). This function orchestrates the external execution through three steps:

  1. Argument Construction: buildParserArgs(entry) (lines 32-40) constructs the command-line array, expanding all placeholders via the template system.
  2. Process Spawning: The promisified execFileAsync wrapper executes the command (lines 82-86), enforcing the timeout_ms and max_buffer_bytes limits specified in the configuration.
  3. Output Capture: STDOUT is buffered and parsed as JSON, with STDERR reserved for logging and debugging.

Result Normalization

Raw script output undergoes validation by normalizeParserJob (lines 58-74). The function expects a JSON payload containing either payload.jobs or payload.results, then transforms each entry into a standardized schema:

{
  "title": "Senior Engineer",
  "url": "https://example.com/job/123",
  "company": "Acme Corp",
  "location": "Berlin, Germany"
}

Entries missing required fields (title or url) are filtered out during normalization (lines 66-73), ensuring data consistency across heterogeneous parser implementations.

Building a Custom Parser Script

Creating a new parser requires only a command-line script that outputs valid JSON to STDOUT. The following Python example demonstrates a minimal implementation for scraping a Greenhouse.io careers page:

import sys, json, requests, argparse

parser = argparse.ArgumentParser()
parser.add_argument('--url')
args = parser.parse_args()

resp = requests.get(args.url)
data = resp.json()
jobs = [
    {"title": j["title"], "url": j["apply_url"], "company": "ExampleCo", "location": j["location"]}
    for j in data
]
print(json.dumps({"jobs": jobs}))

The script receives expanded arguments from buildParserArgs, queries the external API, and prints a JSON object to STDOUT. The local-parser.mjs provider handles process management, timeout enforcement, and schema validation automatically.

Integration with the Career-Ops Scanner

The scan.mjs orchestrator iterates through all configured providers, invoking detect() on each until one returns a non-null value. Because local-parser.mjs executes locally and returns structured data immediately, it bypasses the browser instantiation and network latency required by remote scraping providers. This architecture makes adding new job sources a configuration-only task—no modifications to scan.mjs or the core provider logic are necessary.

Summary

  • The Provider contract in providers/_types.js requires implementing detect() and fetch() methods for all scraping sources.
  • Configuration-driven binding via portals.yml uses placeholder expansion through expandParserArg to pass context-aware arguments to external scripts.
  • Detection logic in detect() (lines 10-16) verifies command availability and script file existence before marking a provider as eligible.
  • Process execution through runLocalParser() uses execFileAsync (lines 82-86) with configurable timeouts and memory limits to prevent resource exhaustion.
  • Schema normalization via normalizeParserJob (lines 58-74) enforces a uniform output format across all parser languages and implementations.

Frequently Asked Questions

What programming languages are supported for custom parsers in local-parser.mjs?

Any language capable of reading command-line arguments and printing JSON to STDOUT is supported. The command field in portals.yml specifies the interpreter (e.g., python3, node, ruby, bash), making the architecture completely language-agnostic according to the source code implementation.

How does local-parser.mjs handle script failures and resource limits?

The provider passes timeout_ms and max_buffer_bytes configuration values directly to execFileAsync (lines 82-86). If a script exceeds the time limit or output buffer, the promise rejects and the provider returns an empty result set, allowing scan.mjs to continue with other providers without freezing the pipeline.

What exact JSON structure must my parser output to be compatible with local-parser.mjs?

Your script must print a JSON object to STDOUT containing either a jobs or results array. Each array element must include title and url fields; company and location are optional. The normalizeParserJob function (lines 58-74) validates these fields and filters out malformed entries before returning data to the scanner.

Do I need to modify core Career-Ops files to add a new job site parser?

No. Adding support for a new job site requires only creating the parser script and adding a corresponding entry in portals.yml. The provider selection logic in scan.mjs automatically discovers available providers through the detect() method interface, enabling zero-code-change extensibility.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →