How the Media Cache System Works in Summarize: Configuration and Architecture

The media cache system in steipete/summarize is a file-system-based store implemented in src/media-cache.ts that uses SHA-256 URL hashing, JSON indexing, and configurable TTL/size limits to persist downloaded media files with optional integrity verification.

The media cache system provides persistent storage for downloaded audio and video files in the steipete/summarize repository, eliminating redundant network requests during transcript extraction. Implemented in src/media-cache.ts, this lightweight subsystem offers configurable eviction policies, integrity verification modes, and TTL-based expiration to balance storage efficiency with data reliability.

Media Cache System Architecture

The architecture centers on a directory-based store with a JSON index tracking metadata for each cached entry.

Cache Directory and Path Resolution

By default, the system stores files in $HOME/.summarize/cache/media. The resolveMediaCachePath function in src/media-cache.ts (lines 44‑61) handles path expansion, supporting ~/ notation and custom overrides via configuration.

Index File Structure

The cache maintains an index.json file (defined as INDEX_FILENAME) that tracks every entry with the following fields:

  • URL: Original source URL
  • Filename: Stored filename
  • Size: Byte count (sizeBytes)
  • SHA‑256: Content hash (when using hash verification)
  • MIME type: Media type identifier
  • Timestamps: createdAtMs, lastAccessAtMs, expiresAtMs

The index version is locked to 1. The readIndex and writeIndex functions (lines 93‑112) handle persistence, with writeIndex using atomic file operations (write to temp, then rename) to prevent corruption.

Entry Key Hashing

To create filesystem-safe keys from arbitrary URLs, the hashKey function (lines 63‑66) applies SHA‑256 hashing. This produces a consistent, collision-resistant identifier regardless of URL special characters or length.

Storage Operations

Put (Store)

The put method validates the URL, computes the file extension from MIME type or original filename via resolveExtension, moves the temporary file into the cache directory, optionally computes SHA‑256 (when verify mode is hash), updates the index, and triggers enforceMaxBytes to maintain size limits.

Get (Retrieve)

The get method looks up entries by URL hash, first calling pruneExpired to remove stale entries. It then verifies the file according to the configured mode (size check or hash comparison), updates lastAccessAtMs, and returns the file path.

Expiration and Eviction

TTL-based expiration: Each entry stores an expiresAtMs timestamp calculated from the configured ttlMs. The pruneExpired function (lines 77‑84) removes expired entries during every get and put operation.

Size enforcement: The enforceMaxBytes function (lines 95‑118) maintains the cache within maxBytes by evicting least-recently-used entries (sorted by lastAccessAtMs) until the total size falls below the limit.

Verification Modes

The MediaCacheVerifyMode type supports three integrity strategies:

  • none: No validation performed; fastest but riskiest.
  • size: Validates that sizeBytes matches the actual file size (default).
  • hash: Computes SHA‑256 of the file and compares against stored hash; strongest integrity guarantee but CPU-intensive.

Media Cache Configuration Options

Configuration resides in the media section of SummarizeConfig (defined in src/config.ts). The following options control behavior:

Option Type Default Description
enabled boolean undefined (inherits global) Master switch for the media cache.
maxMb number 2048 (DEFAULT_MEDIA_CACHE_MAX_MB) Maximum cache size in MiB. When exceeded, oldest entries are evicted.
ttlDays number 7 (DEFAULT_MEDIA_CACHE_TTL_DAYS) Entry lifetime in days. Expired entries are pruned automatically.
path string $HOME/.summarize/cache/media Filesystem location. Supports ~ expansion and relative paths.
verify "none" | "size" | "hash" "size" (DEFAULT_MEDIA_CACHE_VERIFY) Integrity verification mode on retrieval.

Defaults are declared in src/media-cache.ts at lines 33‑36:

export const DEFAULT_MEDIA_CACHE_MAX_MB = 2048;
export const DEFAULT_MEDIA_CACHE_TTL_DAYS = 7;
export const DEFAULT_MEDIA_CACHE_VERIFY: MediaCacheVerifyMode = "size";

Practical Usage Examples

Initializing the Cache

The following example demonstrates loading configuration and creating a cache instance:

import { createMediaCache, resolveMediaCachePath, DEFAULT_MEDIA_CACHE_MAX_MB, DEFAULT_MEDIA_CACHE_TTL_DAYS, DEFAULT_MEDIA_CACHE_VERIFY } from "./src/media-cache.js";

async function initCache() {
  // Simulated user configuration
  const cfg = {
    enabled: true,
    maxMb: 1024,               // 1 GiB limit
    ttlDays: 14,               // Two week retention
    path: "~/my-media-cache",  // Custom directory
    verify: "hash" as const,   // Strongest integrity check
  };

  const cachePath = resolveMediaCachePath({
    env: process.env,
    cachePath: cfg.path ?? null,
  });

  if (!cachePath) {
    throw new Error("Unable to resolve cache directory");
  }

  const maxBytes = (cfg.maxMb ?? DEFAULT_MEDIA_CACHE_MAX_MB) * 1024 * 1024;
  const ttlMs = (cfg.ttlDays ?? DEFAULT_MEDIA_CACHE_TTL_DAYS) * 24 * 60 * 60 * 1000;

  const mediaCache = await createMediaCache({
    path: cachePath,
    maxBytes,
    ttlMs,
    verify: cfg.verify ?? DEFAULT_MEDIA_CACHE_VERIFY,
  });

  return mediaCache;
}

Storing Media Files

Use the put method to persist downloaded content:

const url = "https://example.com/podcast/episode.mp3";
const tempPath = "/tmp/downloaded-episode.mp3";

const entry = await mediaCache.put({
  url,
  filePath: tempPath,
  mediaType: "audio/mpeg",
  filename: "episode.mp3",
});

console.log(`Cached at: ${entry.filePath}`);
console.log(`SHA-256: ${entry.hash}`);

The put operation automatically handles extension resolution, index updates, and size limit enforcement via enforceMaxBytes.

Retrieving Cached Media

The get method handles validation and expiration checking:

const cached = await mediaCache.get({ 
  url: "https://example.com/podcast/episode.mp3" 
});

if (cached) {
  console.log("Cache hit:", cached.filePath);
  // File is ready for transcription or playback
} else {
  console.log("Cache miss - download required");
}

During retrieval, the system prunes expired entries, verifies file integrity according to the configured mode, and updates the access timestamp for LRU tracking.

Summary

  • The media cache system in steipete/summarize provides persistent, filesystem-based storage for downloaded media files, implemented in src/media-cache.ts.
  • It uses SHA-256 URL hashing via hashKey for stable entry keys, a JSON index.json for metadata tracking, and supports three verification modes: none, size, and hash.
  • Configuration options include enabled, maxMb (default 2048), ttlDays (default 7), path, and verify (default size), defined in the media section of the config.
  • The system automatically handles TTL-based expiration via pruneExpired and size enforcement via enforceMaxBytes using LRU eviction.
  • Atomic index writes and path resolution with tilde expansion ensure reliability across different environments.

Frequently Asked Questions

What is the default location for the media cache?

By default, the media cache stores files in $HOME/.summarize/cache/media. You can customize this location using the path configuration option, which supports ~ expansion for the home directory and relative paths that resolve against the current working directory.

How does the media cache handle corrupted or modified files?

The cache uses configurable verification modes to detect corruption. When verify is set to "size" (the default), the system checks that the file size matches the stored metadata. For stronger protection, set verify to "hash", which computes and compares SHA-256 checksums on every read. If verification fails, the entry is treated as a cache miss.

What happens when the cache exceeds the configured size limit?

When the total cache size exceeds maxMb (default 2048 MiB), the system triggers LRU eviction via the enforceMaxBytes function. It sorts entries by lastAccessAtMs (oldest first) and removes files until the total size falls below the limit. This process runs automatically during put operations.

Can I disable the media cache entirely?

Yes. Set enabled: false in the media configuration section. When disabled, the application bypasses the cache and downloads media files directly to temporary locations for each transcription job. This is useful for environments with strict storage constraints or when processing sensitive content that should not persist on disk.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →