How WorldMonitor's Baseline Algorithm Detects Trending Keywords While Filtering Statistical Noise
WorldMonitor detects emerging trending keywords by comparing short-term activity against a 7-day moving-average baseline, then applies multi-layered filters—including source diversity checks, token-length guards, and suppressed-term lists—to eliminate statistical noise.
WorldMonitor uses a deterministic baseline-over-spike approach to surface breaking topics from high-velocity news streams. The algorithm, implemented in TypeScript, maintains per-term time series to distinguish genuine emerging trends from random chatter. This article examines the core mechanics of how the baseline algorithm detects trending keywords while filtering statistical noise, referencing the actual source implementation in the koala73/worldmonitor repository.
Core Architecture of the Baseline Detection Engine
The detection engine lives in src/services/trending-keywords.ts. It operates by maintaining a rolling time series for every candidate term, then executing a statistical comparison between recent velocity and historical norms.
Ingestion and Time-Series Maintenance
The process begins with ingestHeadlines(), which processes raw news items and populates the internal state:
- Tokenization: The title is parsed using
tokenize()fromsrc/utils/analysis-constants.ts, which handles regex-based entity extraction and stop-word removal. - Timestamp recording: Every candidate term receives an entry in the global
termFrequencymap viarecordTermCandidates, tagged with its publication timestamp and source attribution. - State pruning:
pruneOldState(now)continuously removes timestamps older than the 7-day baseline window and enforces theMAX_TRACKED_TERMSlimit (10,000 terms) to prevent memory bloat.
Baseline Calculation Strategy
Every hour (**BASELINE_REFRESH_MS**), the system recomputes historical expectations:
// Inside maybeRefreshBaselines(now)
record.baseline7d = weekCount / 7;
This yields a 7-day moving-average baseline for each term, representing its typical daily volume. The baseline serves as the statistical expectation against which recent surges are measured.
Spike Detection Logic
The checkForSpikes(now, config, blockedTerms) function executes the core comparison:
- Recent count aggregation: Counts occurrences within the 2-hour rolling window (
**ROLLING_WINDOW_MS**). - Spike threshold test: A term triggers when either:
recentCount ≥ baseline * spikeMultiplier(default multiplier: 3×), orrecentCount ≥ minSpikeCount(for terms lacking sufficient historical data).
- Noise-reduction guards:
- Source diversity: Requires
uniqueSources ≥ **MIN_SPIKE_SOURCE_COUNT**(minimum 2 distinct news sources). - Cooldown period:
SPIKE_COOLDOWN_MSprevents repeated alerts for the same term. - Token length filter: Ignores tokens shorter than
**MIN_TOKEN_LENGTH**unless flagged as entities.
- Source diversity: Requires
When validated, handleSpike() generates a keyword_spike CorrelationSignal containing the term, confidence score, and contextual metadata.
Multi-Layered Noise Filtering Mechanisms
Statistical noise reduction operates through several defensive layers that prevent common verbs, stop-words, and transient bursts from triggering false alerts.
Static and Dynamic Suppression Lists
- Hard-coded filters: The
SUPPRESSED_TRENDING_TERMSarray insrc/utils/analysis-constants.tsexcludes generic nouns, common verbs, and filler words. - User-defined blocks: The
suppressTrendingTerm()API persists blocked terms vialocalStorage, allowing operators to manually quiet noisy keywords at runtime. - Entity-only bypass: Terms flagged with
isEntity: truebypass the minimum token-length filter, ensuring that short named entities (e.g., "ISIS", "WHO") are not discarded.
ML-Based Entity Enrichment
When enabled, the optional ML NER worker in src/services/ml-worker.ts enriches the term list with high-confidence entities that might otherwise be missed by regex extraction alone. This layer catches emerging proper nouns before they accumulate enough frequency to bypass standard tokenization filters.
Configuring Detection Sensitivity
The algorithm exposes runtime tunables via updateTrendingConfig(), allowing sensitivity adjustments without redeployment.
Adjust Spike Thresholds
Lower the multiplier to catch subtle trends, or raise it to surface only major breaking news:
import { updateTrendingConfig } from '@/services/trending-keywords';
// More sensitive detection
updateTrendingConfig({ spikeMultiplier: 2, minSpikeCount: 3 });
Retrieve Generated Signals
Consume detected spikes through the drain API:
import { drainTrendingSignals } from '@/services/trending-keywords';
const signals = drainTrendingSignals(); // Returns CorrelationSignal[]
signals.forEach(sig => {
console.log(`Trending: ${sig.title} (confidence: ${sig.confidence})`);
});
Note: drainTrendingSignals() clears the internal queue, making it safe for polling integrations.
Manually Suppress Noisy Terms
import { suppressTrendingTerm } from '@/services/trending-keywords';
// Block a term causing false positives
suppressTrendingTerm('breaking');
Practical Implementation Examples
Ingesting Headlines from RSS Feeds
Process incoming articles to update the time-series state:
import { ingestHeadlines } from '@/services/trending-keywords';
const headlines = [
{
title: 'Assad forces clash with HTS near Damascus',
pubDate: new Date(),
source: 'Reuters',
link: 'https://reuters.com/…',
},
];
ingestHeadlines(headlines);
This call immediately updates termFrequency and triggers the spike detection loop.
Current Configuration Inspection
import { getTrendingConfig } from '@/services/trending-keywords';
console.log('Active config:', getTrendingConfig());
// Output: { spikeMultiplier: 3, minSpikeCount: 5, ... }
Summary
- Baseline-over-spike detection: The algorithm compares 2-hour rolling counts against 7-day moving averages to identify statistically significant surges.
- Multi-source validation: Requires at least 2 distinct news sources to confirm a trend, filtering out single-source anomalies.
- Configurable thresholds:
spikeMultiplierandminSpikeCountallow runtime tuning of detection sensitivity viaupdateTrendingConfig(). - Aggressive noise filtering: Combines static suppressed-term lists, token-length guards, cooldown timers, and optional ML NER to eliminate statistical noise.
- Memory-bounded operation:
MAX_TRACKED_TERMS(10,000) and automatic pruning of stale data ensure consistent performance in long-running browser or Node.js environments.
Frequently Asked Questions
How does the baseline algorithm differentiate between a genuine trend and a random spike?
The algorithm requires a 2-hour recent count to exceed the 7-day moving-average baseline by a configurable multiplier (default 3×) or a minimum absolute threshold. Additionally, it enforces source diversity (minimum 2 distinct sources) and a cooldown period to prevent single-source viral stories or burst traffic from generating repeated alerts.
Which file contains the suppressed terms list that filters statistical noise?
The SUPPRESSED_TRENDING_TERMS array is defined in src/utils/analysis-constants.ts. This list contains common verbs, stop-words, and generic nouns that are statistically likely to spike randomly but carry no topical significance.
Can detection sensitivity be adjusted without restarting the application?
Yes. The updateTrendingConfig() function in src/services/trending-keywords.ts accepts partial configuration objects (e.g., { spikeMultiplier: 2 }) and persists changes to localStorage. These adjustments take effect immediately on the next spike detection cycle without requiring a service restart.
How does WorldMonitor prevent short tokens or acronyms from being filtered as noise?
While the algorithm ignores tokens shorter than MIN_TOKEN_LENGTH by default, it preserves terms flagged as entities (isEntity: true). Furthermore, the optional ML NER worker (src/services/ml-worker.ts) enriches candidate lists with high-confidence named entities, ensuring that short but significant terms (e.g., "WHO", "EU") are retained for baseline comparison.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →