Repartition in Spark vs Coalesce: Performance Differences for Data Redistribution

Repartition always triggers a full shuffle to redistribute data evenly across partitions, while coalesce avoids shuffling when decreasing partition count by merging existing partitions, resulting in significantly lower network overhead but potentially uneven data distribution.

When optimizing Apache Spark jobs in the apache/spark repository, choosing between repartition and coalesce directly impacts network I/O, execution time, and resource utilization. Both transformations modify partition counts for RDDs, DataFrames, and Datasets, yet they operate through fundamentally different physical execution strategies that determine whether data moves across the cluster network.

How Data Redistribution Works Under the Hood

Repartition: The Full Shuffle Approach

In core/src/main/scala/org/apache/spark/rdd/RDD.scala, the repartition(numPartitions: Int) method always triggers a complete shuffle of the dataset across the cluster. The implementation creates a new ShuffleRDD utilizing a HashPartitioner to redistribute every record, ensuring the resulting partitions are roughly equal in size regardless of the source data distribution.

This full shuffle incurs network I/O, disk I/O (spill), and CPU overhead proportional to the total data size. The Dataset API in sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala extends this capability to column-based partitioning via repartition(col: Column, cols: Column*), which internally generates a hash-based shuffle on the specified columns as implemented in ShuffleExchangeExec.scala.

Coalesce: Merging Without Shuffling

The coalesce(numPartitions: Int, shuffle: Boolean = false) method, also defined in RDD.scala, takes a fundamentally different approach when decreasing partition count. By default, it creates a CoalescedRDD that simply collapses existing partitions into fewer ones without moving data across nodes, as handled in CoalescePartitionsExec.scala for the SQL engine.

When shuffle remains false (the default), Spark avoids the expensive shuffle operation entirely, making the transformation much cheaper in terms of network and disk resources. However, if you request more partitions than currently exist, or explicitly set shuffle = true, Spark falls back to the same shuffle-based mechanism used by repartition.

Performance Trade-offs: Shuffle Cost vs Data Balance

Network Cost: Using coalesce to reduce partitions eliminates the need to serialize and transfer every record across the network, dramatically reducing job duration on large datasets. Conversely, repartition moves every record regardless of whether the partition count increases or decreases.

Data Balance: The full shuffle performed by repartition produces uniformly sized partitions, which is essential for downstream operations sensitive to load imbalance such as joins or aggregations. coalesce may leave some partitions significantly larger than others, potentially creating straggler tasks that slow overall execution.

Parallelism Control: Only repartition can increase the number of partitions to raise parallelism for CPU-bound operations. coalesce cannot create new partitions without shuffling; attempting to increase partitions with coalesce triggers a fallback to the shuffle-based repartition logic.

Source Code Implementation Details

RDD API Implementation

According to the apache/spark source code, the core logic resides in core/src/main/scala/org/apache/spark/rdd/RDD.scala:

  • def repartition(numPartitions: Int): Returns a new ShuffleRDD with a HashPartitioner at line 2105
  • def coalesce(numPartitions: Int, shuffle: Boolean = false): Returns a CoalescedRDD when shuffle is false, or delegates to repartition when true, at line 2159

Dataset and DataFrame API

In sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:

  • def repartition(col: Column, cols: Column*): Implemented at line 3060, triggers ShuffleExchangeExec for hash-based column partitioning
  • def coalesce(numPartitions: Int): Implemented at line 3082, delegates to the underlying RDD's coalesce method

Practical Code Examples

The following Scala examples demonstrate the behavioral differences between these transformations:

// Example 1: Reducing partitions without a shuffle (cheap)
val df = spark.read.parquet("s3://bucket/large-data")
val reduced = df.coalesce(10)          // merges existing partitions into 10
reduced.write.mode("overwrite").parquet("s3://bucket/reduced")   // fast, but may be skewed

In this snippet, coalesce simply merges adjacent partitions, avoiding network transfer entirely.

// Example 2: Rebalancing data after a skewed operation (expensive but balanced)
val skewed = df.repartition(200)       // forces a full shuffle to 200 partitions
skewed.groupBy("user_id").count().show()

Here, repartition triggers a full shuffle to achieve an even distribution across 200 partitions, preventing task skew during the subsequent aggregation.

// Example 3: Column-based repartition (hash partition on a key)
val keyed = df.repartition(col("country"))   // shuffles based on `country` column
keyed.write.partitionBy("country").mode("overwrite").parquet("s3://bucket/by-country")

This demonstrates how repartition accepts column expressions to create a HashPartitioner based on specific column values, optimizing data locality for downstream operations.

Summary

  • Repartition always performs a full shuffle using a HashPartitioner, creating evenly distributed partitions at the cost of significant network and disk I/O
  • Coalesce avoids shuffling when decreasing partition count by merging existing partitions in CoalescedRDD, offering better performance but potentially uneven data distribution
  • Use coalesce when reducing partitions to save network cost, and repartition when increasing partitions or when downstream operations require balanced data distribution
  • The underlying implementations reside in RDD.scala for core logic and Dataset.scala for DataFrame/Dataset APIs, with physical execution handled by ShuffleExchangeExec and CoalescePartitionsExec

Frequently Asked Questions

When should I use repartition instead of coalesce in Spark?

Use repartition when you need to increase the number of partitions to improve parallelism for CPU-intensive operations, or when downstream stages require evenly balanced partitions to prevent task skew. According to the source code in apache/spark, repartition is also necessary when performing column-based partitioning for joins or aggregations, as it creates the required HashPartitioner through a full shuffle.

Does coalesce always avoid shuffling data in Spark?

No, coalesce only avoids shuffling when the target number of partitions is less than the current number and shuffle remains false (the default). As implemented in RDD.scala, if you request more partitions than exist, or explicitly set shuffle = true, the method falls back to the shuffle-based repartition logic, incurring the same network overhead as a standard repartition operation.

Why does repartition create more balanced partitions than coalesce?

repartition utilizes a HashPartitioner to redistribute every record across the cluster during a full shuffle, ensuring statistical uniformity in partition sizes. In contrast, coalesce merely combines adjacent existing partitions into fewer ones without redistributing individual records, which preserves any pre-existing data skew from upstream operations and may result in significantly uneven partition sizes.

Can coalesce increase the number of partitions in a Spark DataFrame?

No, coalesce cannot increase partition count without triggering a shuffle. The Dataset.coalesce implementation in sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala delegates to the underlying RDD logic, which detects when the target partition count exceeds the current count and automatically falls back to repartition behavior. To explicitly increase partitions, use repartition directly to ensure proper hash-based redistribution.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →