# Understanding the Plugin Architecture of local-parser.mjs for Custom Job Site Scraping

> Explore the plugin architecture of local-parser.mjs for custom job site scraping. Execute external scripts locally for efficient, standardized job listing data extraction.

- Repository: [Santiago Fernández de Valderrama/career-ops](https://github.com/santifer/career-ops)
- Tags: internals
- Published: 2026-06-07

---

**The `local-parser.mjs` module implements a provider-based plugin architecture that executes external scripts to scrape job listings locally, eliminating network overhead while maintaining a standardized JSON interface.**

The Career-Ops scanner leverages this architecture to enable extensible job site scraping without modifying core source code. By treating local executables—written in Python, Node.js, Bash, or any CLI-capable language—as first-class providers, the system bypasses costly Playwright browser automation for sites that expose structured data endpoints.

## Provider Contract and Type Safety

All providers must conform to the `Provider` type defined in [`providers/_types.js`](https://github.com/santifer/career-ops/blob/main/providers/_types.js). This contract enforces a consistent interface requiring two methods:

- **`detect(entry)`**: Determines if the provider can handle a specific portal configuration
- **`fetch(entry)`**: Returns a promise resolving to an array of normalized job listings

This strict typing ensures that `local-parser.mjs` integrates seamlessly with the provider selection logic in `scan.mjs`, allowing the scanner to treat local scripts identically to network-based providers.

## Configuration-Driven Script Binding

The plugin architecture relies on declarative configuration within [`portals.yml`](https://github.com/santifer/career-ops/blob/main/portals.yml). Each portal entry can specify a `parser` block that binds an external script to a specific careers page:

```yaml
- name: ExampleCo
  careers_url: https://example.com/careers
  parser:
    command: python3
    script: ./parsers/exampleco.py
    args: ['--url', '{careers_url}']
    timeout_ms: 20000
    max_buffer_bytes: 1000000

```

The `command` field specifies the executable interpreter, while `script` provides the optional path to the parser file. Placeholders like `{careers_url}` and `{company}` are dynamically expanded by `expandParserArg` (lines 13-17) before execution, allowing context-specific arguments to be passed to external scripts.

## Detection and Execution Pipeline

The execution flow follows a three-phase pattern: detection, command building, and result normalization.

### Provider Detection Logic

The `detect(entry)` method (lines 10-16) validates that a portal entry should use local parsing by checking two conditions: the presence of `parser.command` in the configuration and the existence of the referenced script file on disk. When successful, it returns either `entry.careers_url` or the literal string `"local-parser"`, signaling to `scan.mjs` that this provider should handle the entry.

### Command Execution and Process Management

Upon provider selection, `fetch(entry)` (lines 18-20) delegates to `runLocalParser(entry)`. This function orchestrates the external execution through three steps:

1. **Argument Construction**: `buildParserArgs(entry)` (lines 32-40) constructs the command-line array, expanding all placeholders via the template system.
2. **Process Spawning**: The promisified `execFileAsync` wrapper executes the command (lines 82-86), enforcing the `timeout_ms` and `max_buffer_bytes` limits specified in the configuration.
3. **Output Capture**: STDOUT is buffered and parsed as JSON, with STDERR reserved for logging and debugging.

### Result Normalization

Raw script output undergoes validation by `normalizeParserJob` (lines 58-74). The function expects a JSON payload containing either `payload.jobs` or `payload.results`, then transforms each entry into a standardized schema:

```json
{
  "title": "Senior Engineer",
  "url": "https://example.com/job/123",
  "company": "Acme Corp",
  "location": "Berlin, Germany"
}

```

Entries missing required fields (`title` or `url`) are filtered out during normalization (lines 66-73), ensuring data consistency across heterogeneous parser implementations.

## Building a Custom Parser Script

Creating a new parser requires only a command-line script that outputs valid JSON to STDOUT. The following Python example demonstrates a minimal implementation for scraping a Greenhouse.io careers page:

```python
import sys, json, requests, argparse

parser = argparse.ArgumentParser()
parser.add_argument('--url')
args = parser.parse_args()

resp = requests.get(args.url)
data = resp.json()
jobs = [
    {"title": j["title"], "url": j["apply_url"], "company": "ExampleCo", "location": j["location"]}
    for j in data
]
print(json.dumps({"jobs": jobs}))

```

The script receives expanded arguments from `buildParserArgs`, queries the external API, and prints a JSON object to STDOUT. The `local-parser.mjs` provider handles process management, timeout enforcement, and schema validation automatically.

## Integration with the Career-Ops Scanner

The `scan.mjs` orchestrator iterates through all configured providers, invoking `detect()` on each until one returns a non-null value. Because `local-parser.mjs` executes locally and returns structured data immediately, it bypasses the browser instantiation and network latency required by remote scraping providers. This architecture makes adding new job sources a configuration-only task—no modifications to `scan.mjs` or the core provider logic are necessary.

## Summary

- The **Provider contract** in [`providers/_types.js`](https://github.com/santifer/career-ops/blob/main/providers/_types.js) requires implementing `detect()` and `fetch()` methods for all scraping sources.
- **Configuration-driven binding** via [`portals.yml`](https://github.com/santifer/career-ops/blob/main/portals.yml) uses placeholder expansion through `expandParserArg` to pass context-aware arguments to external scripts.
- **Detection logic** in `detect()` (lines 10-16) verifies command availability and script file existence before marking a provider as eligible.
- **Process execution** through `runLocalParser()` uses `execFileAsync` (lines 82-86) with configurable timeouts and memory limits to prevent resource exhaustion.
- **Schema normalization** via `normalizeParserJob` (lines 58-74) enforces a uniform output format across all parser languages and implementations.

## Frequently Asked Questions

### What programming languages are supported for custom parsers in local-parser.mjs?

Any language capable of reading command-line arguments and printing JSON to STDOUT is supported. The `command` field in [`portals.yml`](https://github.com/santifer/career-ops/blob/main/portals.yml) specifies the interpreter (e.g., `python3`, `node`, `ruby`, `bash`), making the architecture completely language-agnostic according to the source code implementation.

### How does local-parser.mjs handle script failures and resource limits?

The provider passes `timeout_ms` and `max_buffer_bytes` configuration values directly to `execFileAsync` (lines 82-86). If a script exceeds the time limit or output buffer, the promise rejects and the provider returns an empty result set, allowing `scan.mjs` to continue with other providers without freezing the pipeline.

### What exact JSON structure must my parser output to be compatible with local-parser.mjs?

Your script must print a JSON object to STDOUT containing either a `jobs` or `results` array. Each array element must include `title` and `url` fields; `company` and `location` are optional. The `normalizeParserJob` function (lines 58-74) validates these fields and filters out malformed entries before returning data to the scanner.

### Do I need to modify core Career-Ops files to add a new job site parser?

No. Adding support for a new job site requires only creating the parser script and adding a corresponding entry in [`portals.yml`](https://github.com/santifer/career-ops/blob/main/portals.yml). The provider selection logic in `scan.mjs` automatically discovers available providers through the `detect()` method interface, enabling zero-code-change extensibility.