Understanding the Plugin Architecture of local-parser.mjs for Custom Job Site Scraping
The local-parser.mjs module implements a provider-based plugin architecture that executes external scripts to scrape job listings locally, eliminating network overhead while maintaining a standardized JSON interface.
The Career-Ops scanner leverages this architecture to enable extensible job site scraping without modifying core source code. By treating local executables—written in Python, Node.js, Bash, or any CLI-capable language—as first-class providers, the system bypasses costly Playwright browser automation for sites that expose structured data endpoints.
Provider Contract and Type Safety
All providers must conform to the Provider type defined in providers/_types.js. This contract enforces a consistent interface requiring two methods:
detect(entry): Determines if the provider can handle a specific portal configurationfetch(entry): Returns a promise resolving to an array of normalized job listings
This strict typing ensures that local-parser.mjs integrates seamlessly with the provider selection logic in scan.mjs, allowing the scanner to treat local scripts identically to network-based providers.
Configuration-Driven Script Binding
The plugin architecture relies on declarative configuration within portals.yml. Each portal entry can specify a parser block that binds an external script to a specific careers page:
- name: ExampleCo
careers_url: https://example.com/careers
parser:
command: python3
script: ./parsers/exampleco.py
args: ['--url', '{careers_url}']
timeout_ms: 20000
max_buffer_bytes: 1000000
The command field specifies the executable interpreter, while script provides the optional path to the parser file. Placeholders like {careers_url} and {company} are dynamically expanded by expandParserArg (lines 13-17) before execution, allowing context-specific arguments to be passed to external scripts.
Detection and Execution Pipeline
The execution flow follows a three-phase pattern: detection, command building, and result normalization.
Provider Detection Logic
The detect(entry) method (lines 10-16) validates that a portal entry should use local parsing by checking two conditions: the presence of parser.command in the configuration and the existence of the referenced script file on disk. When successful, it returns either entry.careers_url or the literal string "local-parser", signaling to scan.mjs that this provider should handle the entry.
Command Execution and Process Management
Upon provider selection, fetch(entry) (lines 18-20) delegates to runLocalParser(entry). This function orchestrates the external execution through three steps:
- Argument Construction:
buildParserArgs(entry)(lines 32-40) constructs the command-line array, expanding all placeholders via the template system. - Process Spawning: The promisified
execFileAsyncwrapper executes the command (lines 82-86), enforcing thetimeout_msandmax_buffer_byteslimits specified in the configuration. - Output Capture: STDOUT is buffered and parsed as JSON, with STDERR reserved for logging and debugging.
Result Normalization
Raw script output undergoes validation by normalizeParserJob (lines 58-74). The function expects a JSON payload containing either payload.jobs or payload.results, then transforms each entry into a standardized schema:
{
"title": "Senior Engineer",
"url": "https://example.com/job/123",
"company": "Acme Corp",
"location": "Berlin, Germany"
}
Entries missing required fields (title or url) are filtered out during normalization (lines 66-73), ensuring data consistency across heterogeneous parser implementations.
Building a Custom Parser Script
Creating a new parser requires only a command-line script that outputs valid JSON to STDOUT. The following Python example demonstrates a minimal implementation for scraping a Greenhouse.io careers page:
import sys, json, requests, argparse
parser = argparse.ArgumentParser()
parser.add_argument('--url')
args = parser.parse_args()
resp = requests.get(args.url)
data = resp.json()
jobs = [
{"title": j["title"], "url": j["apply_url"], "company": "ExampleCo", "location": j["location"]}
for j in data
]
print(json.dumps({"jobs": jobs}))
The script receives expanded arguments from buildParserArgs, queries the external API, and prints a JSON object to STDOUT. The local-parser.mjs provider handles process management, timeout enforcement, and schema validation automatically.
Integration with the Career-Ops Scanner
The scan.mjs orchestrator iterates through all configured providers, invoking detect() on each until one returns a non-null value. Because local-parser.mjs executes locally and returns structured data immediately, it bypasses the browser instantiation and network latency required by remote scraping providers. This architecture makes adding new job sources a configuration-only task—no modifications to scan.mjs or the core provider logic are necessary.
Summary
- The Provider contract in
providers/_types.jsrequires implementingdetect()andfetch()methods for all scraping sources. - Configuration-driven binding via
portals.ymluses placeholder expansion throughexpandParserArgto pass context-aware arguments to external scripts. - Detection logic in
detect()(lines 10-16) verifies command availability and script file existence before marking a provider as eligible. - Process execution through
runLocalParser()usesexecFileAsync(lines 82-86) with configurable timeouts and memory limits to prevent resource exhaustion. - Schema normalization via
normalizeParserJob(lines 58-74) enforces a uniform output format across all parser languages and implementations.
Frequently Asked Questions
What programming languages are supported for custom parsers in local-parser.mjs?
Any language capable of reading command-line arguments and printing JSON to STDOUT is supported. The command field in portals.yml specifies the interpreter (e.g., python3, node, ruby, bash), making the architecture completely language-agnostic according to the source code implementation.
How does local-parser.mjs handle script failures and resource limits?
The provider passes timeout_ms and max_buffer_bytes configuration values directly to execFileAsync (lines 82-86). If a script exceeds the time limit or output buffer, the promise rejects and the provider returns an empty result set, allowing scan.mjs to continue with other providers without freezing the pipeline.
What exact JSON structure must my parser output to be compatible with local-parser.mjs?
Your script must print a JSON object to STDOUT containing either a jobs or results array. Each array element must include title and url fields; company and location are optional. The normalizeParserJob function (lines 58-74) validates these fields and filters out malformed entries before returning data to the scanner.
Do I need to modify core Career-Ops files to add a new job site parser?
No. Adding support for a new job site requires only creating the parser script and adding a corresponding entry in portals.yml. The provider selection logic in scan.mjs automatically discovers available providers through the detect() method interface, enabling zero-code-change extensibility.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →