# How the Media Cache System Works in Summarize: Configuration and Architecture

> Discover how the media cache system in steipete/summarize efficiently stores downloaded media files using SHA-256 hashing, JSON indexing, and configurable limits. Learn about its architecture and configuration options.

- Repository: [Peter Steinberger/summarize](https://github.com/steipete/summarize)
- Tags: internals
- Published: 2026-02-19

---

**The media cache system in `steipete/summarize` is a file-system-based store implemented in [`src/media-cache.ts`](https://github.com/steipete/summarize/blob/main/src/media-cache.ts) that uses SHA-256 URL hashing, JSON indexing, and configurable TTL/size limits to persist downloaded media files with optional integrity verification.**

The media cache system provides persistent storage for downloaded audio and video files in the `steipete/summarize` repository, eliminating redundant network requests during transcript extraction. Implemented in [`src/media-cache.ts`](https://github.com/steipete/summarize/blob/main/src/media-cache.ts), this lightweight subsystem offers configurable eviction policies, integrity verification modes, and TTL-based expiration to balance storage efficiency with data reliability.

## Media Cache System Architecture

The architecture centers on a directory-based store with a JSON index tracking metadata for each cached entry.

### Cache Directory and Path Resolution

By default, the system stores files in `$HOME/.summarize/cache/media`. The `resolveMediaCachePath` function in [`src/media-cache.ts`](https://github.com/steipete/summarize/blob/main/src/media-cache.ts) (lines 44‑61) handles path expansion, supporting `~/` notation and custom overrides via configuration.

### Index File Structure

The cache maintains an [`index.json`](https://github.com/steipete/summarize/blob/main/index.json) file (defined as `INDEX_FILENAME`) that tracks every entry with the following fields:

- **URL**: Original source URL
- **Filename**: Stored filename
- **Size**: Byte count (`sizeBytes`)
- **SHA‑256**: Content hash (when using hash verification)
- **MIME type**: Media type identifier
- **Timestamps**: `createdAtMs`, `lastAccessAtMs`, `expiresAtMs`

The index version is locked to `1`. The `readIndex` and `writeIndex` functions (lines 93‑112) handle persistence, with `writeIndex` using atomic file operations (write to temp, then rename) to prevent corruption.

### Entry Key Hashing

To create filesystem-safe keys from arbitrary URLs, the `hashKey` function (lines 63‑66) applies SHA‑256 hashing. This produces a consistent, collision-resistant identifier regardless of URL special characters or length.

### Storage Operations

#### Put (Store)

The `put` method validates the URL, computes the file extension from MIME type or original filename via `resolveExtension`, moves the temporary file into the cache directory, optionally computes SHA‑256 (when `verify` mode is `hash`), updates the index, and triggers `enforceMaxBytes` to maintain size limits.

#### Get (Retrieve)

The `get` method looks up entries by URL hash, first calling `pruneExpired` to remove stale entries. It then verifies the file according to the configured mode (size check or hash comparison), updates `lastAccessAtMs`, and returns the file path.

### Expiration and Eviction

**TTL-based expiration**: Each entry stores an `expiresAtMs` timestamp calculated from the configured `ttlMs`. The `pruneExpired` function (lines 77‑84) removes expired entries during every `get` and `put` operation.

**Size enforcement**: The `enforceMaxBytes` function (lines 95‑118) maintains the cache within `maxBytes` by evicting least-recently-used entries (sorted by `lastAccessAtMs`) until the total size falls below the limit.

### Verification Modes

The `MediaCacheVerifyMode` type supports three integrity strategies:

- **`none`**: No validation performed; fastest but riskiest.
- **`size`**: Validates that `sizeBytes` matches the actual file size (default).
- **`hash`**: Computes SHA‑256 of the file and compares against stored hash; strongest integrity guarantee but CPU-intensive.

## Media Cache Configuration Options

Configuration resides in the **`media`** section of `SummarizeConfig` (defined in [`src/config.ts`](https://github.com/steipete/summarize/blob/main/src/config.ts)). The following options control behavior:

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `enabled` | `boolean` | `undefined` (inherits global) | Master switch for the media cache. |
| `maxMb` | `number` | `2048` (`DEFAULT_MEDIA_CACHE_MAX_MB`) | Maximum cache size in MiB. When exceeded, oldest entries are evicted. |
| `ttlDays` | `number` | `7` (`DEFAULT_MEDIA_CACHE_TTL_DAYS`) | Entry lifetime in days. Expired entries are pruned automatically. |
| `path` | `string` | `$HOME/.summarize/cache/media` | Filesystem location. Supports `~` expansion and relative paths. |
| `verify` | `"none" \| "size" \| "hash"` | `"size"` (`DEFAULT_MEDIA_CACHE_VERIFY`) | Integrity verification mode on retrieval. |

Defaults are declared in [`src/media-cache.ts`](https://github.com/steipete/summarize/blob/main/src/media-cache.ts) at lines 33‑36:

```typescript
export const DEFAULT_MEDIA_CACHE_MAX_MB = 2048;
export const DEFAULT_MEDIA_CACHE_TTL_DAYS = 7;
export const DEFAULT_MEDIA_CACHE_VERIFY: MediaCacheVerifyMode = "size";

```

## Practical Usage Examples

### Initializing the Cache

The following example demonstrates loading configuration and creating a cache instance:

```typescript
import { createMediaCache, resolveMediaCachePath, DEFAULT_MEDIA_CACHE_MAX_MB, DEFAULT_MEDIA_CACHE_TTL_DAYS, DEFAULT_MEDIA_CACHE_VERIFY } from "./src/media-cache.js";

async function initCache() {
  // Simulated user configuration
  const cfg = {
    enabled: true,
    maxMb: 1024,               // 1 GiB limit
    ttlDays: 14,               // Two week retention
    path: "~/my-media-cache",  // Custom directory
    verify: "hash" as const,   // Strongest integrity check
  };

  const cachePath = resolveMediaCachePath({
    env: process.env,
    cachePath: cfg.path ?? null,
  });

  if (!cachePath) {
    throw new Error("Unable to resolve cache directory");
  }

  const maxBytes = (cfg.maxMb ?? DEFAULT_MEDIA_CACHE_MAX_MB) * 1024 * 1024;
  const ttlMs = (cfg.ttlDays ?? DEFAULT_MEDIA_CACHE_TTL_DAYS) * 24 * 60 * 60 * 1000;

  const mediaCache = await createMediaCache({
    path: cachePath,
    maxBytes,
    ttlMs,
    verify: cfg.verify ?? DEFAULT_MEDIA_CACHE_VERIFY,
  });

  return mediaCache;
}

```

### Storing Media Files

Use the `put` method to persist downloaded content:

```typescript
const url = "https://example.com/podcast/episode.mp3";
const tempPath = "/tmp/downloaded-episode.mp3";

const entry = await mediaCache.put({
  url,
  filePath: tempPath,
  mediaType: "audio/mpeg",
  filename: "episode.mp3",
});

console.log(`Cached at: ${entry.filePath}`);
console.log(`SHA-256: ${entry.hash}`);

```

The `put` operation automatically handles extension resolution, index updates, and size limit enforcement via `enforceMaxBytes`.

### Retrieving Cached Media

The `get` method handles validation and expiration checking:

```typescript
const cached = await mediaCache.get({ 
  url: "https://example.com/podcast/episode.mp3" 
});

if (cached) {
  console.log("Cache hit:", cached.filePath);
  // File is ready for transcription or playback
} else {
  console.log("Cache miss - download required");
}

```

During retrieval, the system prunes expired entries, verifies file integrity according to the configured mode, and updates the access timestamp for LRU tracking.

## Summary

- The **media cache system** in `steipete/summarize` provides persistent, filesystem-based storage for downloaded media files, implemented in [`src/media-cache.ts`](https://github.com/steipete/summarize/blob/main/src/media-cache.ts).
- It uses **SHA-256 URL hashing** via `hashKey` for stable entry keys, a JSON [`index.json`](https://github.com/steipete/summarize/blob/main/index.json) for metadata tracking, and supports three verification modes: `none`, `size`, and `hash`.
- **Configuration options** include `enabled`, `maxMb` (default 2048), `ttlDays` (default 7), `path`, and `verify` (default `size`), defined in the `media` section of the config.
- The system automatically handles **TTL-based expiration** via `pruneExpired` and **size enforcement** via `enforceMaxBytes` using LRU eviction.
- Atomic index writes and path resolution with tilde expansion ensure reliability across different environments.

## Frequently Asked Questions

### What is the default location for the media cache?

By default, the media cache stores files in `$HOME/.summarize/cache/media`. You can customize this location using the `path` configuration option, which supports `~` expansion for the home directory and relative paths that resolve against the current working directory.

### How does the media cache handle corrupted or modified files?

The cache uses configurable **verification modes** to detect corruption. When `verify` is set to `"size"` (the default), the system checks that the file size matches the stored metadata. For stronger protection, set `verify` to `"hash"`, which computes and compares SHA-256 checksums on every read. If verification fails, the entry is treated as a cache miss.

### What happens when the cache exceeds the configured size limit?

When the total cache size exceeds `maxMb` (default 2048 MiB), the system triggers **LRU eviction** via the `enforceMaxBytes` function. It sorts entries by `lastAccessAtMs` (oldest first) and removes files until the total size falls below the limit. This process runs automatically during `put` operations.

### Can I disable the media cache entirely?

Yes. Set `enabled: false` in the `media` configuration section. When disabled, the application bypasses the cache and downloads media files directly to temporary locations for each transcription job. This is useful for environments with strict storage constraints or when processing sensitive content that should not persist on disk.