# Repartition in Spark vs Coalesce: Performance Differences for Data Redistribution

> Understand repartition vs coalesce performance in Spark. Learn how repartition triggers a full shuffle for even distribution, while coalesce merges partitions to reduce overhead, impacting data balance.

- Repository: [The Apache Software Foundation/spark](https://github.com/apache/spark)
- Tags: performance
- Published: 2026-02-16

---

**Repartition always triggers a full shuffle to redistribute data evenly across partitions, while coalesce avoids shuffling when decreasing partition count by merging existing partitions, resulting in significantly lower network overhead but potentially uneven data distribution.**

When optimizing Apache Spark jobs in the `apache/spark` repository, choosing between `repartition` and `coalesce` directly impacts network I/O, execution time, and resource utilization. Both transformations modify partition counts for RDDs, DataFrames, and Datasets, yet they operate through fundamentally different physical execution strategies that determine whether data moves across the cluster network.

## How Data Redistribution Works Under the Hood

### Repartition: The Full Shuffle Approach

In [`core/src/main/scala/org/apache/spark/rdd/RDD.scala`](https://github.com/apache/spark/blob/main/core/src/main/scala/org/apache/spark/rdd/RDD.scala), the `repartition(numPartitions: Int)` method always triggers a complete shuffle of the dataset across the cluster. The implementation creates a new `ShuffleRDD` utilizing a `HashPartitioner` to redistribute every record, ensuring the resulting partitions are roughly equal in size regardless of the source data distribution.

This full shuffle incurs **network I/O, disk I/O (spill), and CPU overhead** proportional to the total data size. The `Dataset` API in [`sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala) extends this capability to column-based partitioning via `repartition(col: Column, cols: Column*)`, which internally generates a hash-based shuffle on the specified columns as implemented in [`ShuffleExchangeExec.scala`](https://github.com/apache/spark/blob/main/ShuffleExchangeExec.scala).

### Coalesce: Merging Without Shuffling

The `coalesce(numPartitions: Int, shuffle: Boolean = false)` method, also defined in [`RDD.scala`](https://github.com/apache/spark/blob/main/RDD.scala), takes a fundamentally different approach when decreasing partition count. By default, it creates a `CoalescedRDD` that simply collapses existing partitions into fewer ones without moving data across nodes, as handled in [`CoalescePartitionsExec.scala`](https://github.com/apache/spark/blob/main/CoalescePartitionsExec.scala) for the SQL engine.

When `shuffle` remains `false` (the default), Spark avoids the expensive shuffle operation entirely, making the transformation **much cheaper** in terms of network and disk resources. However, if you request more partitions than currently exist, or explicitly set `shuffle = true`, Spark falls back to the same shuffle-based mechanism used by `repartition`.

## Performance Trade-offs: Shuffle Cost vs Data Balance

**Network Cost:** Using `coalesce` to reduce partitions eliminates the need to serialize and transfer every record across the network, dramatically reducing job duration on large datasets. Conversely, `repartition` moves every record regardless of whether the partition count increases or decreases.

**Data Balance:** The full shuffle performed by `repartition` produces uniformly sized partitions, which is essential for downstream operations sensitive to load imbalance such as joins or aggregations. `coalesce` may leave some partitions significantly larger than others, potentially creating straggler tasks that slow overall execution.

**Parallelism Control:** Only `repartition` can increase the number of partitions to raise parallelism for CPU-bound operations. `coalesce` cannot create new partitions without shuffling; attempting to increase partitions with `coalesce` triggers a fallback to the shuffle-based `repartition` logic.

## Source Code Implementation Details

### RDD API Implementation

According to the `apache/spark` source code, the core logic resides in [`core/src/main/scala/org/apache/spark/rdd/RDD.scala`](https://github.com/apache/spark/blob/main/core/src/main/scala/org/apache/spark/rdd/RDD.scala):

- **`def repartition(numPartitions: Int)`**: Returns a new `ShuffleRDD` with a `HashPartitioner` at line 2105
- **`def coalesce(numPartitions: Int, shuffle: Boolean = false)`**: Returns a `CoalescedRDD` when `shuffle` is false, or delegates to `repartition` when true, at line 2159

### Dataset and DataFrame API

In [`sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala):

- **`def repartition(col: Column, cols: Column*)`**: Implemented at line 3060, triggers `ShuffleExchangeExec` for hash-based column partitioning
- **`def coalesce(numPartitions: Int)`**: Implemented at line 3082, delegates to the underlying RDD's `coalesce` method

## Practical Code Examples

The following Scala examples demonstrate the behavioral differences between these transformations:

```scala
// Example 1: Reducing partitions without a shuffle (cheap)
val df = spark.read.parquet("s3://bucket/large-data")
val reduced = df.coalesce(10)          // merges existing partitions into 10
reduced.write.mode("overwrite").parquet("s3://bucket/reduced")   // fast, but may be skewed

```

In this snippet, `coalesce` simply merges adjacent partitions, avoiding network transfer entirely.

```scala
// Example 2: Rebalancing data after a skewed operation (expensive but balanced)
val skewed = df.repartition(200)       // forces a full shuffle to 200 partitions
skewed.groupBy("user_id").count().show()

```

Here, `repartition` triggers a full shuffle to achieve an even distribution across 200 partitions, preventing task skew during the subsequent aggregation.

```scala
// Example 3: Column-based repartition (hash partition on a key)
val keyed = df.repartition(col("country"))   // shuffles based on `country` column
keyed.write.partitionBy("country").mode("overwrite").parquet("s3://bucket/by-country")

```

This demonstrates how `repartition` accepts column expressions to create a `HashPartitioner` based on specific column values, optimizing data locality for downstream operations.

## Summary

- **Repartition** always performs a full shuffle using a `HashPartitioner`, creating evenly distributed partitions at the cost of significant network and disk I/O
- **Coalesce** avoids shuffling when decreasing partition count by merging existing partitions in `CoalescedRDD`, offering better performance but potentially uneven data distribution
- Use `coalesce` when reducing partitions to save network cost, and `repartition` when increasing partitions or when downstream operations require balanced data distribution
- The underlying implementations reside in [`RDD.scala`](https://github.com/apache/spark/blob/main/RDD.scala) for core logic and [`Dataset.scala`](https://github.com/apache/spark/blob/main/Dataset.scala) for DataFrame/Dataset APIs, with physical execution handled by `ShuffleExchangeExec` and `CoalescePartitionsExec`

## Frequently Asked Questions

### When should I use repartition instead of coalesce in Spark?

Use `repartition` when you need to **increase** the number of partitions to improve parallelism for CPU-intensive operations, or when downstream stages require **evenly balanced** partitions to prevent task skew. According to the source code in `apache/spark`, `repartition` is also necessary when performing column-based partitioning for joins or aggregations, as it creates the required `HashPartitioner` through a full shuffle.

### Does coalesce always avoid shuffling data in Spark?

No, `coalesce` only avoids shuffling when the target number of partitions is **less than** the current number and `shuffle` remains `false` (the default). As implemented in [`RDD.scala`](https://github.com/apache/spark/blob/main/RDD.scala), if you request more partitions than exist, or explicitly set `shuffle = true`, the method falls back to the shuffle-based `repartition` logic, incurring the same network overhead as a standard `repartition` operation.

### Why does repartition create more balanced partitions than coalesce?

`repartition` utilizes a `HashPartitioner` to redistribute **every record** across the cluster during a full shuffle, ensuring statistical uniformity in partition sizes. In contrast, `coalesce` merely combines adjacent existing partitions into fewer ones without redistributing individual records, which preserves any pre-existing data skew from upstream operations and may result in significantly uneven partition sizes.

### Can coalesce increase the number of partitions in a Spark DataFrame?

No, `coalesce` cannot increase partition count without triggering a shuffle. The `Dataset.coalesce` implementation in [`sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala) delegates to the underlying RDD logic, which detects when the target partition count exceeds the current count and automatically falls back to `repartition` behavior. To explicitly increase partitions, use `repartition` directly to ensure proper hash-based redistribution.