# How WorldMonitor's Baseline Algorithm Detects Trending Keywords While Filtering Statistical Noise

> WorldMonitor's baseline algorithm detects trending keywords by comparing activity to a 7-day average and filtering out statistical noise using multi-layered checks.

- Repository: [Elie Habib/worldmonitor](https://github.com/koala73/worldmonitor)
- Tags: internals
- Published: 2026-03-09

---

**WorldMonitor detects emerging trending keywords by comparing short-term activity against a 7-day moving-average baseline, then applies multi-layered filters—including source diversity checks, token-length guards, and suppressed-term lists—to eliminate statistical noise.**

WorldMonitor uses a deterministic baseline-over-spike approach to surface breaking topics from high-velocity news streams. The algorithm, implemented in TypeScript, maintains per-term time series to distinguish genuine emerging trends from random chatter. This article examines the core mechanics of how the baseline algorithm detects trending keywords while filtering statistical noise, referencing the actual source implementation in the `koala73/worldmonitor` repository.

## Core Architecture of the Baseline Detection Engine

The detection engine lives in **[`src/services/trending-keywords.ts`](https://github.com/koala73/worldmonitor/blob/main/src/services/trending-keywords.ts)**. It operates by maintaining a rolling time series for every candidate term, then executing a statistical comparison between recent velocity and historical norms.

### Ingestion and Time-Series Maintenance

The process begins with **`ingestHeadlines()`**, which processes raw news items and populates the internal state:

- **Tokenization**: The title is parsed using **`tokenize()`** from **[`src/utils/analysis-constants.ts`](https://github.com/koala73/worldmonitor/blob/main/src/utils/analysis-constants.ts)**, which handles regex-based entity extraction and stop-word removal.
- **Timestamp recording**: Every candidate term receives an entry in the global `termFrequency` map via **`recordTermCandidates`**, tagged with its publication timestamp and source attribution.
- **State pruning**: **`pruneOldState(now)`** continuously removes timestamps older than the **7-day baseline window** and enforces the **`MAX_TRACKED_TERMS`** limit (10,000 terms) to prevent memory bloat.

### Baseline Calculation Strategy

Every hour (`**BASELINE_REFRESH_MS**`), the system recomputes historical expectations:

```typescript
// Inside maybeRefreshBaselines(now)
record.baseline7d = weekCount / 7;

```

This yields a **7-day moving-average baseline** for each term, representing its typical daily volume. The baseline serves as the statistical expectation against which recent surges are measured.

### Spike Detection Logic

The **`checkForSpikes(now, config, blockedTerms)`** function executes the core comparison:

1. **Recent count aggregation**: Counts occurrences within the **2-hour rolling window** (`**ROLLING_WINDOW_MS**`).
2. **Spike threshold test**: A term triggers when either:
   - `recentCount ≥ baseline * spikeMultiplier` (default multiplier: **3×**), or
   - `recentCount ≥ minSpikeCount` (for terms lacking sufficient historical data).
3. **Noise-reduction guards**:
   - **Source diversity**: Requires `uniqueSources ≥ **MIN_SPIKE_SOURCE_COUNT**` (minimum 2 distinct news sources).
   - **Cooldown period**: **`SPIKE_COOLDOWN_MS`** prevents repeated alerts for the same term.
   - **Token length filter**: Ignores tokens shorter than `**MIN_TOKEN_LENGTH**` unless flagged as entities.

When validated, **`handleSpike()`** generates a **`keyword_spike`** `CorrelationSignal` containing the term, confidence score, and contextual metadata.

## Multi-Layered Noise Filtering Mechanisms

Statistical noise reduction operates through several defensive layers that prevent common verbs, stop-words, and transient bursts from triggering false alerts.

### Static and Dynamic Suppression Lists

- **Hard-coded filters**: The **`SUPPRESSED_TRENDING_TERMS`** array in **[`src/utils/analysis-constants.ts`](https://github.com/koala73/worldmonitor/blob/main/src/utils/analysis-constants.ts)** excludes generic nouns, common verbs, and filler words.
- **User-defined blocks**: The **`suppressTrendingTerm()`** API persists blocked terms via `localStorage`, allowing operators to manually quiet noisy keywords at runtime.
- **Entity-only bypass**: Terms flagged with `isEntity: true` bypass the minimum token-length filter, ensuring that short named entities (e.g., "ISIS", "WHO") are not discarded.

### ML-Based Entity Enrichment

When enabled, the optional ML NER worker in **[`src/services/ml-worker.ts`](https://github.com/koala73/worldmonitor/blob/main/src/services/ml-worker.ts)** enriches the term list with high-confidence entities that might otherwise be missed by regex extraction alone. This layer catches emerging proper nouns before they accumulate enough frequency to bypass standard tokenization filters.

## Configuring Detection Sensitivity

The algorithm exposes runtime tunables via **`updateTrendingConfig()`**, allowing sensitivity adjustments without redeployment.

### Adjust Spike Thresholds

Lower the multiplier to catch subtle trends, or raise it to surface only major breaking news:

```typescript
import { updateTrendingConfig } from '@/services/trending-keywords';

// More sensitive detection
updateTrendingConfig({ spikeMultiplier: 2, minSpikeCount: 3 });

```

### Retrieve Generated Signals

Consume detected spikes through the drain API:

```typescript
import { drainTrendingSignals } from '@/services/trending-keywords';

const signals = drainTrendingSignals(); // Returns CorrelationSignal[]
signals.forEach(sig => {
  console.log(`Trending: ${sig.title} (confidence: ${sig.confidence})`);
});

```

*Note: `drainTrendingSignals()` clears the internal queue, making it safe for polling integrations.*

### Manually Suppress Noisy Terms

```typescript
import { suppressTrendingTerm } from '@/services/trending-keywords';

// Block a term causing false positives
suppressTrendingTerm('breaking');

```

## Practical Implementation Examples

### Ingesting Headlines from RSS Feeds

Process incoming articles to update the time-series state:

```typescript
import { ingestHeadlines } from '@/services/trending-keywords';

const headlines = [
  {
    title: 'Assad forces clash with HTS near Damascus',
    pubDate: new Date(),
    source: 'Reuters',
    link: 'https://reuters.com/…',
  },
];

ingestHeadlines(headlines);

```

*This call immediately updates `termFrequency` and triggers the spike detection loop.*

### Current Configuration Inspection

```typescript
import { getTrendingConfig } from '@/services/trending-keywords';

console.log('Active config:', getTrendingConfig());
// Output: { spikeMultiplier: 3, minSpikeCount: 5, ... }

```

## Summary

- **Baseline-over-spike detection**: The algorithm compares 2-hour rolling counts against 7-day moving averages to identify statistically significant surges.
- **Multi-source validation**: Requires at least 2 distinct news sources to confirm a trend, filtering out single-source anomalies.
- **Configurable thresholds**: `spikeMultiplier` and `minSpikeCount` allow runtime tuning of detection sensitivity via `updateTrendingConfig()`.
- **Aggressive noise filtering**: Combines static suppressed-term lists, token-length guards, cooldown timers, and optional ML NER to eliminate statistical noise.
- **Memory-bounded operation**: `MAX_TRACKED_TERMS` (10,000) and automatic pruning of stale data ensure consistent performance in long-running browser or Node.js environments.

## Frequently Asked Questions

### How does the baseline algorithm differentiate between a genuine trend and a random spike?

The algorithm requires a **2-hour recent count** to exceed the **7-day moving-average baseline** by a configurable multiplier (default **3×**) or a minimum absolute threshold. Additionally, it enforces **source diversity** (minimum 2 distinct sources) and a **cooldown period** to prevent single-source viral stories or burst traffic from generating repeated alerts.

### Which file contains the suppressed terms list that filters statistical noise?

The **`SUPPRESSED_TRENDING_TERMS`** array is defined in **[`src/utils/analysis-constants.ts`](https://github.com/koala73/worldmonitor/blob/main/src/utils/analysis-constants.ts)**. This list contains common verbs, stop-words, and generic nouns that are statistically likely to spike randomly but carry no topical significance.

### Can detection sensitivity be adjusted without restarting the application?

Yes. The **`updateTrendingConfig()`** function in **[`src/services/trending-keywords.ts`](https://github.com/koala73/worldmonitor/blob/main/src/services/trending-keywords.ts)** accepts partial configuration objects (e.g., `{ spikeMultiplier: 2 }`) and persists changes to `localStorage`. These adjustments take effect immediately on the next spike detection cycle without requiring a service restart.

### How does WorldMonitor prevent short tokens or acronyms from being filtered as noise?

While the algorithm ignores tokens shorter than **`MIN_TOKEN_LENGTH`** by default, it preserves terms flagged as **entities** (`isEntity: true`). Furthermore, the optional **ML NER worker** ([`src/services/ml-worker.ts`](https://github.com/koala73/worldmonitor/blob/main/src/services/ml-worker.ts)) enriches candidate lists with high-confidence named entities, ensuring that short but significant terms (e.g., "WHO", "EU") are retained for baseline comparison.