How the Media Cache System Works in Summarize: Configuration and Architecture
The media cache system in steipete/summarize is a file-system-based store implemented in src/media-cache.ts that uses SHA-256 URL hashing, JSON indexing, and configurable TTL/size limits to persist downloaded media files with optional integrity verification.
The media cache system provides persistent storage for downloaded audio and video files in the steipete/summarize repository, eliminating redundant network requests during transcript extraction. Implemented in src/media-cache.ts, this lightweight subsystem offers configurable eviction policies, integrity verification modes, and TTL-based expiration to balance storage efficiency with data reliability.
Media Cache System Architecture
The architecture centers on a directory-based store with a JSON index tracking metadata for each cached entry.
Cache Directory and Path Resolution
By default, the system stores files in $HOME/.summarize/cache/media. The resolveMediaCachePath function in src/media-cache.ts (lines 44‑61) handles path expansion, supporting ~/ notation and custom overrides via configuration.
Index File Structure
The cache maintains an index.json file (defined as INDEX_FILENAME) that tracks every entry with the following fields:
- URL: Original source URL
- Filename: Stored filename
- Size: Byte count (
sizeBytes) - SHA‑256: Content hash (when using hash verification)
- MIME type: Media type identifier
- Timestamps:
createdAtMs,lastAccessAtMs,expiresAtMs
The index version is locked to 1. The readIndex and writeIndex functions (lines 93‑112) handle persistence, with writeIndex using atomic file operations (write to temp, then rename) to prevent corruption.
Entry Key Hashing
To create filesystem-safe keys from arbitrary URLs, the hashKey function (lines 63‑66) applies SHA‑256 hashing. This produces a consistent, collision-resistant identifier regardless of URL special characters or length.
Storage Operations
Put (Store)
The put method validates the URL, computes the file extension from MIME type or original filename via resolveExtension, moves the temporary file into the cache directory, optionally computes SHA‑256 (when verify mode is hash), updates the index, and triggers enforceMaxBytes to maintain size limits.
Get (Retrieve)
The get method looks up entries by URL hash, first calling pruneExpired to remove stale entries. It then verifies the file according to the configured mode (size check or hash comparison), updates lastAccessAtMs, and returns the file path.
Expiration and Eviction
TTL-based expiration: Each entry stores an expiresAtMs timestamp calculated from the configured ttlMs. The pruneExpired function (lines 77‑84) removes expired entries during every get and put operation.
Size enforcement: The enforceMaxBytes function (lines 95‑118) maintains the cache within maxBytes by evicting least-recently-used entries (sorted by lastAccessAtMs) until the total size falls below the limit.
Verification Modes
The MediaCacheVerifyMode type supports three integrity strategies:
none: No validation performed; fastest but riskiest.size: Validates thatsizeBytesmatches the actual file size (default).hash: Computes SHA‑256 of the file and compares against stored hash; strongest integrity guarantee but CPU-intensive.
Media Cache Configuration Options
Configuration resides in the media section of SummarizeConfig (defined in src/config.ts). The following options control behavior:
| Option | Type | Default | Description |
|---|---|---|---|
enabled |
boolean |
undefined (inherits global) |
Master switch for the media cache. |
maxMb |
number |
2048 (DEFAULT_MEDIA_CACHE_MAX_MB) |
Maximum cache size in MiB. When exceeded, oldest entries are evicted. |
ttlDays |
number |
7 (DEFAULT_MEDIA_CACHE_TTL_DAYS) |
Entry lifetime in days. Expired entries are pruned automatically. |
path |
string |
$HOME/.summarize/cache/media |
Filesystem location. Supports ~ expansion and relative paths. |
verify |
"none" | "size" | "hash" |
"size" (DEFAULT_MEDIA_CACHE_VERIFY) |
Integrity verification mode on retrieval. |
Defaults are declared in src/media-cache.ts at lines 33‑36:
export const DEFAULT_MEDIA_CACHE_MAX_MB = 2048;
export const DEFAULT_MEDIA_CACHE_TTL_DAYS = 7;
export const DEFAULT_MEDIA_CACHE_VERIFY: MediaCacheVerifyMode = "size";
Practical Usage Examples
Initializing the Cache
The following example demonstrates loading configuration and creating a cache instance:
import { createMediaCache, resolveMediaCachePath, DEFAULT_MEDIA_CACHE_MAX_MB, DEFAULT_MEDIA_CACHE_TTL_DAYS, DEFAULT_MEDIA_CACHE_VERIFY } from "./src/media-cache.js";
async function initCache() {
// Simulated user configuration
const cfg = {
enabled: true,
maxMb: 1024, // 1 GiB limit
ttlDays: 14, // Two week retention
path: "~/my-media-cache", // Custom directory
verify: "hash" as const, // Strongest integrity check
};
const cachePath = resolveMediaCachePath({
env: process.env,
cachePath: cfg.path ?? null,
});
if (!cachePath) {
throw new Error("Unable to resolve cache directory");
}
const maxBytes = (cfg.maxMb ?? DEFAULT_MEDIA_CACHE_MAX_MB) * 1024 * 1024;
const ttlMs = (cfg.ttlDays ?? DEFAULT_MEDIA_CACHE_TTL_DAYS) * 24 * 60 * 60 * 1000;
const mediaCache = await createMediaCache({
path: cachePath,
maxBytes,
ttlMs,
verify: cfg.verify ?? DEFAULT_MEDIA_CACHE_VERIFY,
});
return mediaCache;
}
Storing Media Files
Use the put method to persist downloaded content:
const url = "https://example.com/podcast/episode.mp3";
const tempPath = "/tmp/downloaded-episode.mp3";
const entry = await mediaCache.put({
url,
filePath: tempPath,
mediaType: "audio/mpeg",
filename: "episode.mp3",
});
console.log(`Cached at: ${entry.filePath}`);
console.log(`SHA-256: ${entry.hash}`);
The put operation automatically handles extension resolution, index updates, and size limit enforcement via enforceMaxBytes.
Retrieving Cached Media
The get method handles validation and expiration checking:
const cached = await mediaCache.get({
url: "https://example.com/podcast/episode.mp3"
});
if (cached) {
console.log("Cache hit:", cached.filePath);
// File is ready for transcription or playback
} else {
console.log("Cache miss - download required");
}
During retrieval, the system prunes expired entries, verifies file integrity according to the configured mode, and updates the access timestamp for LRU tracking.
Summary
- The media cache system in
steipete/summarizeprovides persistent, filesystem-based storage for downloaded media files, implemented insrc/media-cache.ts. - It uses SHA-256 URL hashing via
hashKeyfor stable entry keys, a JSONindex.jsonfor metadata tracking, and supports three verification modes:none,size, andhash. - Configuration options include
enabled,maxMb(default 2048),ttlDays(default 7),path, andverify(defaultsize), defined in themediasection of the config. - The system automatically handles TTL-based expiration via
pruneExpiredand size enforcement viaenforceMaxBytesusing LRU eviction. - Atomic index writes and path resolution with tilde expansion ensure reliability across different environments.
Frequently Asked Questions
What is the default location for the media cache?
By default, the media cache stores files in $HOME/.summarize/cache/media. You can customize this location using the path configuration option, which supports ~ expansion for the home directory and relative paths that resolve against the current working directory.
How does the media cache handle corrupted or modified files?
The cache uses configurable verification modes to detect corruption. When verify is set to "size" (the default), the system checks that the file size matches the stored metadata. For stronger protection, set verify to "hash", which computes and compares SHA-256 checksums on every read. If verification fails, the entry is treated as a cache miss.
What happens when the cache exceeds the configured size limit?
When the total cache size exceeds maxMb (default 2048 MiB), the system triggers LRU eviction via the enforceMaxBytes function. It sorts entries by lastAccessAtMs (oldest first) and removes files until the total size falls below the limit. This process runs automatically during put operations.
Can I disable the media cache entirely?
Yes. Set enabled: false in the media configuration section. When disabled, the application bypasses the cache and downloads media files directly to temporary locations for each transcription job. This is useful for environments with strict storage constraints or when processing sensitive content that should not persist on disk.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →