Spark Checkpoint Versus Persist to Disk: Functional Differences for Fault Tolerance
Checkpointing truncates RDD lineage and writes to reliable distributed storage to survive driver restarts, while persist-to-disk caches blocks locally in the BlockManager for fast reuse without lineage truncation or driver fault tolerance.
When building resilient Spark applications in the apache/spark repository, choosing between spark checkpoint versus persist to disk determines how your application recovers from failures. While both mechanisms store intermediate data, checkpointing provides durable, lineage-truncating storage suitable for long-running jobs, whereas persist-to-disk offers fast, local caching without surviving driver restarts. Understanding these implementation details from the source code helps optimize fault tolerance for streaming and iterative workloads.
How Spark Checkpoint Works
Spark checkpointing materializes the entire RDD to a storage system and truncates its lineage graph. The implementation distinguishes between reliable checkpoints (for driver fault tolerance) and local checkpoints (for lineage truncation only).
Reliable Checkpoint Implementation
The RDD.checkpoint() method in core/src/main/scala/org/apache/spark/rdd/RDD.scala marks the RDD for checkpointing, but the actual materialization occurs later in ReliableRDDCheckpointData.doCheckpoint() from core/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala. This implementation writes every partition to a reliable distributed filesystem such as HDFS or S3.
When doCheckpoint() executes, it triggers a separate job that serializes each partition to the checkpoint directory configured via SparkContext.setCheckpointDir(). After successful completion, the RDD’s parent dependencies are replaced with a CheckpointRDD reference, effectively truncating the lineage graph. This truncation prevents the driver from having to recompute the entire transformation history after a failure.
Local Checkpoint Implementation
For scenarios requiring lineage truncation without the overhead of writing to remote storage, RDD.localCheckpoint() in RDD.scala adapts the storage level to use disk-based caching and invokes LocalRDDCheckpointData.doCheckpoint() from core/src/main/scala/org/apache/spark/rdd/LocalRDDCheckpointData.scala.
This implementation writes data to the executor’s local BlockManager rather than to a reliable filesystem. While this truncates lineage and prevents recomputation of long transformation chains, it does not survive driver restarts because the checkpoint metadata exists only in the driver’s memory.
How Persist to Disk Works
Persisting to disk operates through an entirely different mechanism focused on performance optimization rather than fault tolerance across driver restarts.
BlockManager Registration and Storage Levels
The RDD.persist(newLevel, allowOverride) method defined in core/src/main/scala/org/apache/spark/rdd/RDD.scala registers the RDD with the BlockManager and sets the storageLevel property. Unlike checkpointing, persist does not trigger immediate materialization; instead, it marks partitions to be stored according to the specified StorageLevel (defined in core/src/main/scala/org/apache/spark/storage/StorageLevel.scala) when an action is executed.
When using StorageLevel.MEMORY_AND_DISK or DISK_ONLY, the BlockManager writes blocks to the executor’s local disk. However, the RDD maintains its full lineage graph, meaning that if an executor is lost, Spark can recompute the lost partitions from the parent RDDs. Persisted blocks survive executor loss only if the storage level includes replication across multiple executors, but they are irretrievably lost if the driver restarts because the BlockManager state is not persisted to durable storage.
Key Differences Between Spark Checkpoint and Persist to Disk
Understanding the functional distinctions helps architects choose the right mechanism for their resilience requirements.
-
Lineage Truncation: Checkpointing truncates the RDD’s dependency graph by replacing parents with a
CheckpointRDD, preventingStackOverflowErrorin iterative algorithms. Persist preserves full lineage, allowing recomputation from source data if cached blocks are lost. -
Driver Fault Tolerance: Checkpoint files survive driver restarts because they reside on reliable distributed storage (HDFS/S3) and are discoverable via
Checkpoint.getCheckpointFilesfromcore/src/main/scala/org/apache/spark/checkpoint/Checkpoint.scala. Persisted blocks are lost when the driver restarts because theBlockManagermetadata exists only in the driver’s memory. -
Storage Location and Durability: Reliable checkpoints write to a distributed filesystem configured via
setCheckpointDir(), while local checkpoints and persist-to-disk write to executor-local storage. Only reliable checkpoints guarantee durability across total cluster failure. -
Performance Characteristics: Checkpointing triggers a separate synchronous job that incurs network serialization and replication overhead. Persist-to-disk writes locally through the
BlockManagerwithout network transfer, making it significantly faster but less durable. -
Typical Use Cases: Use checkpointing for long-running streaming applications or iterative graph algorithms (e.g., PageRank) where lineage growth causes memory issues or where exact recovery after driver failure is required. Use persist-to-disk for caching lookup tables or intermediate results reused multiple times within a single job where fast access matters more than driver-level fault tolerance.
Practical Code Examples
The following examples demonstrate the API differences based on the source implementations in the apache/spark repository.
Reliable Checkpoint Example
This example uses RDD.checkpoint() to truncate lineage and enable driver recovery via HDFS:
val sc = new SparkContext(conf)
// Configure reliable checkpoint directory on HDFS
sc.setCheckpointDir("hdfs://namenode:8020/checkpoints")
val raw = sc.textFile("hdfs://data/logs")
val parsed = raw.map(parseLine)
// Mark for checkpoint - lineage will be truncated after first action
parsed.checkpoint()
// First action triggers ReliableRDDCheckpointData.doCheckpoint()
val count = parsed.count()
// Files now stored in HDFS; parent RDDs can be garbage collected
Source: RDD.checkpoint and ReliableRDDCheckpointData.doCheckpoint
Local Checkpoint Example
For lineage truncation without remote storage overhead, use localCheckpoint():
val sc = new SparkContext(conf)
// Long lineage GraphX computation
val graph = ...
// Truncate lineage using local executor storage
graph.localCheckpoint()
// Triggers LocalRDDCheckpointData.doCheckpoint() to BlockManager
graph.vertices.count()
// Data cached on executor disks; lineage truncated but lost if driver restarts
Source: RDD.localCheckpoint and LocalRDDCheckpointData.doCheckpoint
Persist to Disk Example
For fast reuse without lineage truncation:
val data = sc.textFile("hdfs://data/input")
val cached = data.persist(StorageLevel.MEMORY_AND_DISK)
// Registers with BlockManager via RDD.persist()
// First action materializes to executor disk
cached.count()
// Subsequent actions reuse cached blocks
cached.filter(_.contains("error")).count()
// Lineage preserved; lost blocks recomputed from source if executor fails
Source: RDD.persist
When to Use Spark Checkpoint Versus Persist to Disk
Use checkpointing when:
- Running iterative algorithms (e.g., machine learning, PageRank) where lineage growth causes
StackOverflowErroror excessive driver memory usage - Operating streaming applications via
DStreamcheckpointing that must survive driver restarts with exactly-once semantics - You need guaranteed recovery from total cluster failure using
Checkpoint.getCheckpointFiles
Use persist-to-disk when:
- Reusing an RDD multiple times within a single job (e.g., broadcasting a lookup table or performing multiple aggregations)
- You require low-latency access and can tolerate recomputation if the driver restarts
- Working with large datasets that exceed memory but fit on local executor disks
Avoid using both mechanisms simultaneously on the same RDD, as checkpointing already materializes data to storage; persisting afterward adds unnecessary overhead without providing additional benefits.
Summary
- Spark checkpoint versus persist to disk represents a trade-off between durability and performance: checkpointing provides driver-level fault tolerance by truncating lineage and writing to reliable storage, while persisting offers fast local caching without lineage truncation.
- Checkpointing triggers a separate job via
ReliableRDDCheckpointData.doCheckpoint()to write to HDFS/S3, making it suitable for long-running applications but expensive to execute. - Persisting registers RDDs with the
BlockManagerviaRDD.persist(), keeping data on executor-local disk with full lineage preservation for fast recomputation. - Only checkpointing survives driver restarts; persisted blocks are lost when the driver process fails because
BlockManagermetadata is not persisted to durable storage.
Frequently Asked Questions
Does persist to disk survive executor failure?
Persisted data survives executor failure only if the StorageLevel includes replication across multiple executors (e.g., MEMORY_AND_DISK_2). Without replication, blocks stored on a failed executor are lost, but Spark can recompute them from the preserved lineage graph. Unlike checkpointing, persist-to-disk never survives driver restarts regardless of the replication factor.
Can local checkpoint recover from driver restart?
No. RDD.localCheckpoint() writes data to the executor's BlockManager via LocalRDDCheckpointData.doCheckpoint() from core/src/main/scala/org/apache/spark/rdd/LocalRDDCheckpointData.scala, storing blocks on local disk. While this truncates lineage to prevent StackOverflowError in iterative algorithms, the checkpoint metadata exists only in the driver's memory. When the driver restarts, this metadata is lost, making local checkpoint unsuitable for driver fault tolerance.
Why is checkpointing slower than persisting to disk?
Checkpointing triggers a separate synchronous job that serializes every partition to a reliable distributed filesystem (HDFS, S3) via ReliableRDDCheckpointData.doCheckpoint(), incurring network serialization, replication, and disk I/O overhead. In contrast, RDD.persist(StorageLevel.DISK_ONLY) writes to the local executor disk through the BlockManager without network transfer or job submission, making it significantly faster but only durable for the lifetime of the executor process.
When should I use both checkpoint and persist together?
You should generally avoid using both mechanisms simultaneously on the same RDD. Checkpointing already materializes the complete dataset to storage and truncates lineage, making the persisted blocks redundant. If you checkpoint an RDD, subsequent actions read from the checkpoint files, not from the BlockManager cache. Use checkpointing for fault tolerance and lineage truncation, or use persist for fast temporary caching within a single application run, but not both on the same data.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →