internals

How WorldMonitor's Baseline Algorithm Detects Trending Keywords While Filtering Statistical Noise

March 9, 2026 koala73/worldmonitor ↗

WorldMonitor detects emerging trending keywords by comparing short-term activity against a 7-day moving-average baseline, then applies multi-layered filters—including source diversity checks, token-length guards, and suppressed-term lists—to eliminate statistical noise.

WorldMonitor uses a deterministic baseline-over-spike approach to surface breaking topics from high-velocity news streams. The algorithm, implemented in TypeScript, maintains per-term time series to distinguish genuine emerging trends from random chatter. This article examines the core mechanics of how the baseline algorithm detects trending keywords while filtering statistical noise, referencing the actual source implementation in the koala73/worldmonitor repository.

Core Architecture of the Baseline Detection Engine

The detection engine lives in src/services/trending-keywords.ts. It operates by maintaining a rolling time series for every candidate term, then executing a statistical comparison between recent velocity and historical norms.

Ingestion and Time-Series Maintenance

The process begins with ingestHeadlines(), which processes raw news items and populates the internal state:

Tokenization: The title is parsed using tokenize() from src/utils/analysis-constants.ts, which handles regex-based entity extraction and stop-word removal.
Timestamp recording: Every candidate term receives an entry in the global termFrequency map via recordTermCandidates, tagged with its publication timestamp and source attribution.
State pruning: pruneOldState(now) continuously removes timestamps older than the 7-day baseline window and enforces the MAX_TRACKED_TERMS limit (10,000 terms) to prevent memory bloat.

Baseline Calculation Strategy

Every hour (**BASELINE_REFRESH_MS**), the system recomputes historical expectations:

// Inside maybeRefreshBaselines(now)
record.baseline7d = weekCount / 7;

This yields a 7-day moving-average baseline for each term, representing its typical daily volume. The baseline serves as the statistical expectation against which recent surges are measured.

Spike Detection Logic

The checkForSpikes(now, config, blockedTerms) function executes the core comparison:

Recent count aggregation: Counts occurrences within the 2-hour rolling window (**ROLLING_WINDOW_MS**).
Spike threshold test: A term triggers when either:
- recentCount ≥ baseline * spikeMultiplier (default multiplier: 3×), or
- recentCount ≥ minSpikeCount (for terms lacking sufficient historical data).
Noise-reduction guards:
- Source diversity: Requires uniqueSources ≥ **MIN_SPIKE_SOURCE_COUNT** (minimum 2 distinct news sources).
- Cooldown period: SPIKE_COOLDOWN_MS prevents repeated alerts for the same term.
- Token length filter: Ignores tokens shorter than **MIN_TOKEN_LENGTH** unless flagged as entities.

When validated, handleSpike() generates a keyword_spike CorrelationSignal containing the term, confidence score, and contextual metadata.

Multi-Layered Noise Filtering Mechanisms

Statistical noise reduction operates through several defensive layers that prevent common verbs, stop-words, and transient bursts from triggering false alerts.

Static and Dynamic Suppression Lists

Hard-coded filters: The SUPPRESSED_TRENDING_TERMS array in src/utils/analysis-constants.ts excludes generic nouns, common verbs, and filler words.
User-defined blocks: The suppressTrendingTerm() API persists blocked terms via localStorage, allowing operators to manually quiet noisy keywords at runtime.
Entity-only bypass: Terms flagged with isEntity: true bypass the minimum token-length filter, ensuring that short named entities (e.g., "ISIS", "WHO") are not discarded.

ML-Based Entity Enrichment

When enabled, the optional ML NER worker in src/services/ml-worker.ts enriches the term list with high-confidence entities that might otherwise be missed by regex extraction alone. This layer catches emerging proper nouns before they accumulate enough frequency to bypass standard tokenization filters.

Configuring Detection Sensitivity

The algorithm exposes runtime tunables via updateTrendingConfig(), allowing sensitivity adjustments without redeployment.

Adjust Spike Thresholds

Lower the multiplier to catch subtle trends, or raise it to surface only major breaking news:

import { updateTrendingConfig } from '@/services/trending-keywords';

// More sensitive detection
updateTrendingConfig({ spikeMultiplier: 2, minSpikeCount: 3 });

Retrieve Generated Signals

Consume detected spikes through the drain API:

import { drainTrendingSignals } from '@/services/trending-keywords';

const signals = drainTrendingSignals(); // Returns CorrelationSignal[]
signals.forEach(sig => {
  console.log(`Trending: ${sig.title} (confidence: ${sig.confidence})`);
});

Note: drainTrendingSignals() clears the internal queue, making it safe for polling integrations.

Manually Suppress Noisy Terms

import { suppressTrendingTerm } from '@/services/trending-keywords';

// Block a term causing false positives
suppressTrendingTerm('breaking');

Practical Implementation Examples

Ingesting Headlines from RSS Feeds

Process incoming articles to update the time-series state:

import { ingestHeadlines } from '@/services/trending-keywords';

const headlines = [
  {
    title: 'Assad forces clash with HTS near Damascus',
    pubDate: new Date(),
    source: 'Reuters',
    link: 'https://reuters.com/…',
  },
];

ingestHeadlines(headlines);

This call immediately updates termFrequency and triggers the spike detection loop.

Current Configuration Inspection

import { getTrendingConfig } from '@/services/trending-keywords';

console.log('Active config:', getTrendingConfig());
// Output: { spikeMultiplier: 3, minSpikeCount: 5, ... }

Summary

Baseline-over-spike detection: The algorithm compares 2-hour rolling counts against 7-day moving averages to identify statistically significant surges.
Multi-source validation: Requires at least 2 distinct news sources to confirm a trend, filtering out single-source anomalies.
Configurable thresholds: spikeMultiplier and minSpikeCount allow runtime tuning of detection sensitivity via updateTrendingConfig().
Aggressive noise filtering: Combines static suppressed-term lists, token-length guards, cooldown timers, and optional ML NER to eliminate statistical noise.
Memory-bounded operation: MAX_TRACKED_TERMS (10,000) and automatic pruning of stale data ensure consistent performance in long-running browser or Node.js environments.

Frequently Asked Questions

How does the baseline algorithm differentiate between a genuine trend and a random spike?

The algorithm requires a 2-hour recent count to exceed the 7-day moving-average baseline by a configurable multiplier (default 3×) or a minimum absolute threshold. Additionally, it enforces source diversity (minimum 2 distinct sources) and a cooldown period to prevent single-source viral stories or burst traffic from generating repeated alerts.

Which file contains the suppressed terms list that filters statistical noise?

The SUPPRESSED_TRENDING_TERMS array is defined in src/utils/analysis-constants.ts. This list contains common verbs, stop-words, and generic nouns that are statistically likely to spike randomly but carry no topical significance.

Can detection sensitivity be adjusted without restarting the application?

Yes. The updateTrendingConfig() function in src/services/trending-keywords.ts accepts partial configuration objects (e.g., { spikeMultiplier: 2 }) and persists changes to localStorage. These adjustments take effect immediately on the next spike detection cycle without requiring a service restart.

How does WorldMonitor prevent short tokens or acronyms from being filtered as noise?

While the algorithm ignores tokens shorter than MIN_TOKEN_LENGTH by default, it preserves terms flagged as entities (isEntity: true). Furthermore, the optional ML NER worker (src/services/ml-worker.ts) enriches candidate lists with high-confidence named entities, ensuring that short but significant terms (e.g., "WHO", "EU") are retained for baseline comparison.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how koala73/worldmonitor works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →