# Memory Footprint Comparison: Lazy vs Eager Dataset Loading in Roboflow Supervision

> Compare memory footprints for lazy vs eager dataset loading in Roboflow Supervision. Discover how lazy loading optimizes RAM usage storing file paths for on demand image loading.

- Repository: [Roboflow/supervision](https://github.com/roboflow/supervision)
- Tags: performance
- Published: 2026-04-06

---

**Lazy loading maintains a constant memory footprint by storing only file paths and loading images on demand, whereas eager loading stores all image arrays in RAM, causing memory usage to scale linearly with dataset size.**

The `DetectionDataset` class in [Roboflow Supervision](https://github.com/roboflow/supervision) provides two distinct initialization strategies that fundamentally alter resource consumption. Understanding the memory footprint comparison between lazy and eager dataset loading is critical when processing large computer vision datasets that may contain hundreds of gigabytes of imagery.

## Understanding the Two Loading Modes

### Lazy Loading: Path-Based Storage

When you instantiate `DetectionDataset` with a `list[str]` of file paths, the library operates in lazy loading mode. According to the implementation in [`src/supervision/dataset/core.py`](https://github.com/roboflow/supervision/blob/main/src/supervision/dataset/core.py) (lines 74-99), the constructor stores only the strings in `self.image_paths`. The actual pixel data remains on disk until explicitly requested through the `_get_image()` method.

**Memory characteristics:**
- Only file paths reside in RAM (typically 50-200 bytes per path)
- Images are loaded via `cv2.imread` inside `_get_image()` when accessed during iteration
- Memory footprint equals roughly one image array plus small bookkeeping structures, regardless of total dataset size

### Eager Loading: In-Memory Arrays

When you provide a `dict[str, np.ndarray]` mapping paths to pre-loaded arrays, the dataset stores all images in `self._images_in_memory` at construction time. This approach eliminates disk I/O during training but dramatically increases RAM requirements.

**Memory characteristics:**
- All arrays stored in `self._images_in_memory` immediately upon instantiation
- Memory usage equals the sum of all image sizes (height × width × channels × bytes per pixel)
- Footprint grows linearly with dataset size N

## Implementation Details in Supervision

The distinction between loading strategies is enforced through type checking in the constructor and specialized helper methods.

### Constructor Logic

The `__init__` method in [`src/supervision/dataset/core.py`](https://github.com/roboflow/supervision/blob/main/src/supervision/dataset/core.py) initializes the storage containers:

```python
def __init__(self, classes, images, annotations):
    # ...

    self._images_in_memory: dict[str, np.ndarray] = {}

```

### Image Retrieval Mechanism

The `_get_image()` method (lines 101-108) handles retrieval differently based on the loading mode:

```python
def _get_image(self, image_path):
    if self._images_in_memory:
        return self._images_in_memory[image_path]   # eager path: dict lookup

    image = cv2.imread(image_path)                 # lazy path: disk read

    return image

```

### Helper Predicates

Supervision provides inspection functions (lines 279-284) to programmatically detect the current mode:

```python
def is_in_memory(dataset):
    return len(dataset._images_in_memory) > 0 or len(dataset.image_paths) == 0

def is_lazy(dataset):
    return len(dataset._images_in_memory) == 0

```

## Practical Code Examples

### Lazy Loading Example

Pass a list of strings to keep memory usage minimal for large datasets:

```python
import supervision as sv

image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
annotations = {p: sv.Detections.empty() for p in image_paths}

ds_lazy = sv.DetectionDataset(
    classes=["cat", "dog"],
    images=image_paths,          # list[str] triggers lazy loading

    annotations=annotations,
)

# Memory footprint remains constant; only current image loaded during iteration

for path, img, ann in ds_lazy:
    print(path, img.shape)       # cv2.imread called here, one at a time

```

### Eager Loading Example

Pass a dictionary of arrays when the dataset fits comfortably in RAM:

```python
import cv2
import supervision as sv

image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
images = {p: cv2.imread(p) for p in image_paths}   # Load all upfront

annotations = {p: sv.Detections.empty() for p in image_paths}

ds_eager = sv.DetectionDataset(
    classes=["cat", "dog"],
    images=images,              # dict[str, np.ndarray] triggers eager loading

    annotations=annotations,
)

# All three images already resident in RAM; retrieval is O(1) dict lookup

for path, img, ann in ds_eager:
    print(path, img.shape)

```

## Performance and Resource Implications

**Lazy loading** is optimal when training on datasets that exceed available RAM. Because only `self.image_paths` (a list of strings) resides in memory, you can work with terabyte-scale collections on modest hardware. The trade-off is disk I/O latency during iteration, as `cv2.imread()` executes for every access.

**Eager loading** maximizes iteration speed by storing all data in `self._images_in_memory`, eliminating disk bottlenecks during training loops. This mode suits smaller datasets that fit entirely in memory but becomes prohibitive when working with high-resolution medical imagery or video frames.

## Summary

- **Lazy loading** stores only file paths in `self.image_paths`, yielding constant memory usage regardless of whether the dataset contains 100 or 100,000 images
- **Eager loading** stores all arrays in `self._images_in_memory`, causing linear memory growth proportional to total pixel count (height × width × 3 × N)
- The constructor in [`src/supervision/dataset/core.py`](https://github.com/roboflow/supervision/blob/main/src/supervision/dataset/core.py) automatically selects the mode based on whether you pass a `list[str]` or `dict[str, np.ndarray]`
- Use `is_lazy()` and `is_in_memory()` helpers (lines 279-284) to programmatically detect the current loading strategy before operations like `merge`
- Choose lazy loading for large-scale training pipelines and eager loading for small datasets requiring maximum throughput

## Frequently Asked Questions

### How do I check if my DetectionDataset is using lazy or eager loading?

Supervision provides inspection utilities in [`src/supervision/dataset/core.py`](https://github.com/roboflow/supervision/blob/main/src/supervision/dataset/core.py) (lines 279-284). The `is_lazy()` function returns `True` when `len(dataset._images_in_memory) == 0`, indicating only file paths are stored. Conversely, `is_in_memory()` returns `True` when the dataset contains cached arrays or no image paths, signaling that all data resides in RAM.

### Can I convert a lazy dataset to eager loading after instantiation?

While `DetectionDataset` does not provide a built-in `load_all()` method, you can manually populate `dataset._images_in_memory` by iterating through the dataset and assigning the loaded arrays back to the internal dictionary. Each access in lazy mode calls `_get_image()`, which uses `cv2.imread()` to load the file on demand. Caching these values transforms the dataset into eager mode, though you must ensure sufficient RAM exists for the complete collection.

### What is the memory overhead of lazy loading for a dataset with 50,000 images?

Lazy loading stores only the file path strings rather than the image arrays. For 50,000 images with average path lengths of 100 characters, the memory overhead remains under 5 MB regardless of image resolution. In contrast, eager loading the same dataset with 1920×1080×3 images would require approximately 300 GB of RAM (assuming uint8 pixels), making lazy loading essential for large-scale computer vision workflows.

### Does lazy loading affect training speed compared to eager loading?

Yes, lazy loading introduces disk I/O overhead since `cv2.imread()` executes during each iteration. Eager loading provides faster access through dictionary lookups in `self._images_in_memory` but requires sufficient RAM to hold the entire dataset. When using NVMe SSDs with sequential access patterns, the speed difference may be negligible, but traditional HDDs or random access patterns will introduce significant bottlenecks during epoch iteration.