# How to Spark Explode Multiple Columns Efficiently and Avoid Cartesian Products

> Learn how to efficiently explode multiple columns in Spark SQL using arrays_zip to combine arrays and avoid Cartesian products. Improve your performance now.

- Repository: [The Apache Software Foundation/spark](https://github.com/apache/spark)
- Tags: how-to-guide
- Published: 2026-02-16

---

**Use `arrays_zip` to combine multiple array columns into a single array of structs, then apply `explode` once to avoid the Cartesian product and performance degradation caused by multiple generators.**

Exploding multiple columns in Spark SQL is a common requirement when normalizing nested data structures, but calling `explode` on each column separately creates separate generator nodes that force a Cartesian product of the results. According to the Apache Spark source code, the most efficient approach is to zip the arrays into a single structure and explode it once, leveraging the `ArraysZip` expression and `ExplodeBase` generator implementation.

## Why the Naive Approach Fails

When you invoke `explode` on several columns within the same `SELECT` clause, Spark instantiates multiple **generator** expressions. The Catalyst optimizer must then construct a Cartesian product of these generators to combine their outputs, resulting in an exponential explosion of rows and severe performance degradation.

The `ExplodeBase` class in [`sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala`](https://github.com/apache/spark/blob/main/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala) implements the core logic for all explode-style generators. When multiple generators exist in the same projection, the planner cannot fuse them into a single Whole-Stage Code Generation unit, forcing interpretive overhead and memory amplification.

## The Solution: Zip Then Explode

The efficient pattern uses `arrays_zip` to merge multiple array columns element-wise into a single array of structs, then applies `explode` exactly once. This creates only one generator node, allowing the optimizer to produce a clean `Generate(Explode, ...)` plan without Cartesian products.

### How arrays_zip Works Under the Hood

The `arrays_zip` function is defined in [`sql/api/src/main/scala/org/apache/spark/sql/functions.scala`](https://github.com/apache/spark/blob/main/sql/api/src/main/scala/org/apache/spark/sql/functions.scala) at line 9611. It registers the `ArraysZip` expression, which lives in [`sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala`](https://github.com/apache/spark/blob/main/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala) around line 462.

At runtime, `ArraysZip` constructs a `GenericArrayData` of `InternalRow` objects, where each row contains the *i*-th element from every input array. This struct array maintains positional alignment between the original columns without requiring joins or shuffles.

### Why a Single Generator Performs Better

By reducing the operation to a single `explode` call, you invoke the `ExplodeBase` implementation only once per input row. The Catalyst optimizer can fuse this generator into a single Whole-Stage Code Generation unit, producing tight Java bytecode loops that process the zipped array sequentially. This eliminates the interpretive overhead and memory amplification caused by multiple generator nodes.

## Practical Code Examples for Spark Explode Multiple Columns

### Scala DataFrame API

```scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{arrays_zip, explode, col}

val spark = SparkSession.builder.appName("MultiExplode").getOrCreate()
import spark.implicits._

// Sample data with two array columns
val df = Seq(
  (1, Array("a", "b"), Array(10, 20)),
  (2, Array("c", "d", "e"), Array(30, 40, 50))
).toDF("id", "letters", "numbers")

// Efficient multi-column explode using arrays_zip
val exploded = df
  .withColumn("zipped", arrays_zip(col("letters"), col("numbers")))
  .withColumn("exploded", explode(col("zipped")))
  .select(
    col("id"),
    col("exploded.letters").as("letter"),
    col("exploded.numbers").as("number")
  )

exploded.show()

```

**Output:**

```

+---+------+------+
| id|letter|number|
+---+------+------+
|  1|     a|    10|
|  1|     b|    20|
|  2|     c|    30|
|  2|     d|    40|
|  2|     e|    50|
+---+------+------+

```

### PySpark Implementation

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import arrays_zip, explode, col

spark = SparkSession.builder.appName("MultiExplode").getOrCreate()

df = spark.createDataFrame([
    (1, ["a", "b"], [10, 20]),
    (2, ["c", "d", "e"], [30, 40, 50])
], ["id", "letters", "numbers"])

exploded = (df
    .withColumn("zipped", arrays_zip(col("letters"), col("numbers")))
    .withColumn("exploded", explode(col("zipped")))
    .select(
        "id",
        col("exploded.letters").alias("letter"),
        col("exploded.numbers").alias("number")
    ))

exploded.show()

```

### Pure SQL Syntax

```sql
SELECT
  id,
  exploded.letters AS letter,
  exploded.numbers AS number
FROM (
  SELECT
    id,
    explode(arrays_zip(letters, numbers)) AS exploded
  FROM sample_table
) t

```

### Handling Map Columns

When working with map columns instead of arrays, convert them to array entries before zipping. The `map_entries` function transforms a map into an array of structs containing `key` and `value` fields.

```scala
import org.apache.spark.sql.functions.{map_entries, arrays_zip, explode, col}

val dfWithMaps = ... // DataFrame with map columns m1 and m2

val explodedMaps = dfWithMaps
  .withColumn("e1", map_entries(col("m1")))
  .withColumn("e2", map_entries(col("m2")))
  .withColumn("zipped", arrays_zip(col("e1"), col("e2")))
  .withColumn("exploded", explode(col("zipped")))
  .select(
    col("exploded.e1.key").as("key1"),
    col("exploded.e1.value").as("value1"),
    col("exploded.e2.key").as("key2"),
    col("exploded.e2.value").as("value2")
  )

```

### Optimizing with inline for Arrays of Structs

If your data is already an array of structs rather than separate array columns, use `inline` or `inline_outer` to skip the zipping step entirely. This function expands the struct fields directly into columns without intermediate struct creation.

```scala
// If arr_col is ArrayType(StructType(...))
df.selectExpr("id", "inline(arr_col) as (col1, col2, col3)")

```

According to [`sql/api/src/main/scala/org/apache/spark/sql/functions.scala`](https://github.com/apache/spark/blob/main/sql/api/src/main/scala/org/apache/spark/sql/functions.scala), `inline` utilizes the same `ExplodeBase` hierarchy as `explode` but eliminates the overhead of the `ArraysZip` transformation, making it the most efficient pattern for pre-structured array data.

## Key Source Files in Apache Spark

The multi-column explode pattern relies on these implementation files in the Apache Spark repository:

- **[`sql/api/src/main/scala/org/apache/spark/sql/functions.scala`](https://github.com/apache/spark/blob/main/sql/api/src/main/scala/org/apache/spark/sql/functions.scala)** – Defines the public API for `explode` (line 8775), `arrays_zip` (line 9611), and `inline`, providing the entry points used in DataFrame operations.

- **[`sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala`](https://github.com/apache/spark/blob/main/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala)** – Contains `ExplodeBase`, the abstract class that implements the core logic for all explode-style generators, including type checking and runtime row expansion.

- **[`sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala`](https://github.com/apache/spark/blob/main/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala)** – Implements the `ArraysZip` expression (around line 462), which constructs the `GenericArrayData` of `InternalRow` objects that power the `arrays_zip` function.

- **[`sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala`](https://github.com/apache/spark/blob/main/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala)** – Registers the SQL function names (`explode`, `arrays_zip`, `inline`) and binds them to their corresponding expression classes at line 366.

- **[`sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala)** – Provides the `select`, `withColumn`, and `selectExpr` methods used to construct the logical plans that invoke these expressions.

## Summary

- **Never use multiple `explode` calls in the same projection**, as this creates separate generator nodes that force a Cartesian product and exponential row expansion.
- **Use `arrays_zip`** to merge multiple array columns into a single array of structs, then explode once to preserve positional alignment without joins.
- **Reference the source**: The `ArraysZip` expression in [`collectionOperations.scala`](https://github.com/apache/spark/blob/main/collectionOperations.scala) and `ExplodeBase` in [`generators.scala`](https://github.com/apache/spark/blob/main/generators.scala) enable Whole-Stage Code Generation for optimal performance.
- **For maps**, convert to entries using `map_entries` before zipping.
- **For existing struct arrays**, use `inline` to skip the zip step entirely and eliminate struct creation overhead.

## Frequently Asked Questions

### What happens if I call explode on multiple columns without zipping?

Calling `explode` on multiple columns in the same `SELECT` clause creates multiple generator expressions in the logical plan. The Catalyst optimizer must compute the Cartesian product of these generators, resulting in a multiplicative explosion of output rows and significantly increased execution time and memory usage.

### Can I use explode_outer with arrays_zip?

Yes. You can substitute `explode_outer` for `explode` when working with zipped arrays. This preserves rows where the zipped array is empty or null by returning NULL values for the struct fields, maintaining outer join semantics while still avoiding the Cartesian product penalty of multiple generators.

### How do I handle columns of different lengths when exploding multiple columns?

The `arrays_zip` function aligns elements by position and stops at the length of the shortest array. To retain all elements from longer arrays, pad the shorter arrays with null values using `concat` and `array` functions before zipping, or handle the remaining elements in a separate operation.

### When should I use inline instead of arrays_zip?

Use `inline` or `inline_outer` when your data is already an array of structs rather than separate array columns. This function expands the struct fields directly into columns without requiring the intermediate `arrays_zip` step, eliminating struct creation overhead and providing the most efficient path for pre-structured array data.