How to Spark Explode Multiple Columns Efficiently and Avoid Cartesian Products
Use arrays_zip to combine multiple array columns into a single array of structs, then apply explode once to avoid the Cartesian product and performance degradation caused by multiple generators.
Exploding multiple columns in Spark SQL is a common requirement when normalizing nested data structures, but calling explode on each column separately creates separate generator nodes that force a Cartesian product of the results. According to the Apache Spark source code, the most efficient approach is to zip the arrays into a single structure and explode it once, leveraging the ArraysZip expression and ExplodeBase generator implementation.
Why the Naive Approach Fails
When you invoke explode on several columns within the same SELECT clause, Spark instantiates multiple generator expressions. The Catalyst optimizer must then construct a Cartesian product of these generators to combine their outputs, resulting in an exponential explosion of rows and severe performance degradation.
The ExplodeBase class in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala implements the core logic for all explode-style generators. When multiple generators exist in the same projection, the planner cannot fuse them into a single Whole-Stage Code Generation unit, forcing interpretive overhead and memory amplification.
The Solution: Zip Then Explode
The efficient pattern uses arrays_zip to merge multiple array columns element-wise into a single array of structs, then applies explode exactly once. This creates only one generator node, allowing the optimizer to produce a clean Generate(Explode, ...) plan without Cartesian products.
How arrays_zip Works Under the Hood
The arrays_zip function is defined in sql/api/src/main/scala/org/apache/spark/sql/functions.scala at line 9611. It registers the ArraysZip expression, which lives in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala around line 462.
At runtime, ArraysZip constructs a GenericArrayData of InternalRow objects, where each row contains the i-th element from every input array. This struct array maintains positional alignment between the original columns without requiring joins or shuffles.
Why a Single Generator Performs Better
By reducing the operation to a single explode call, you invoke the ExplodeBase implementation only once per input row. The Catalyst optimizer can fuse this generator into a single Whole-Stage Code Generation unit, producing tight Java bytecode loops that process the zipped array sequentially. This eliminates the interpretive overhead and memory amplification caused by multiple generator nodes.
Practical Code Examples for Spark Explode Multiple Columns
Scala DataFrame API
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{arrays_zip, explode, col}
val spark = SparkSession.builder.appName("MultiExplode").getOrCreate()
import spark.implicits._
// Sample data with two array columns
val df = Seq(
(1, Array("a", "b"), Array(10, 20)),
(2, Array("c", "d", "e"), Array(30, 40, 50))
).toDF("id", "letters", "numbers")
// Efficient multi-column explode using arrays_zip
val exploded = df
.withColumn("zipped", arrays_zip(col("letters"), col("numbers")))
.withColumn("exploded", explode(col("zipped")))
.select(
col("id"),
col("exploded.letters").as("letter"),
col("exploded.numbers").as("number")
)
exploded.show()
Output:
+---+------+------+
| id|letter|number|
+---+------+------+
| 1| a| 10|
| 1| b| 20|
| 2| c| 30|
| 2| d| 40|
| 2| e| 50|
+---+------+------+
PySpark Implementation
from pyspark.sql import SparkSession
from pyspark.sql.functions import arrays_zip, explode, col
spark = SparkSession.builder.appName("MultiExplode").getOrCreate()
df = spark.createDataFrame([
(1, ["a", "b"], [10, 20]),
(2, ["c", "d", "e"], [30, 40, 50])
], ["id", "letters", "numbers"])
exploded = (df
.withColumn("zipped", arrays_zip(col("letters"), col("numbers")))
.withColumn("exploded", explode(col("zipped")))
.select(
"id",
col("exploded.letters").alias("letter"),
col("exploded.numbers").alias("number")
))
exploded.show()
Pure SQL Syntax
SELECT
id,
exploded.letters AS letter,
exploded.numbers AS number
FROM (
SELECT
id,
explode(arrays_zip(letters, numbers)) AS exploded
FROM sample_table
) t
Handling Map Columns
When working with map columns instead of arrays, convert them to array entries before zipping. The map_entries function transforms a map into an array of structs containing key and value fields.
import org.apache.spark.sql.functions.{map_entries, arrays_zip, explode, col}
val dfWithMaps = ... // DataFrame with map columns m1 and m2
val explodedMaps = dfWithMaps
.withColumn("e1", map_entries(col("m1")))
.withColumn("e2", map_entries(col("m2")))
.withColumn("zipped", arrays_zip(col("e1"), col("e2")))
.withColumn("exploded", explode(col("zipped")))
.select(
col("exploded.e1.key").as("key1"),
col("exploded.e1.value").as("value1"),
col("exploded.e2.key").as("key2"),
col("exploded.e2.value").as("value2")
)
Optimizing with inline for Arrays of Structs
If your data is already an array of structs rather than separate array columns, use inline or inline_outer to skip the zipping step entirely. This function expands the struct fields directly into columns without intermediate struct creation.
// If arr_col is ArrayType(StructType(...))
df.selectExpr("id", "inline(arr_col) as (col1, col2, col3)")
According to sql/api/src/main/scala/org/apache/spark/sql/functions.scala, inline utilizes the same ExplodeBase hierarchy as explode but eliminates the overhead of the ArraysZip transformation, making it the most efficient pattern for pre-structured array data.
Key Source Files in Apache Spark
The multi-column explode pattern relies on these implementation files in the Apache Spark repository:
-
sql/api/src/main/scala/org/apache/spark/sql/functions.scala– Defines the public API forexplode(line 8775),arrays_zip(line 9611), andinline, providing the entry points used in DataFrame operations. -
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala– ContainsExplodeBase, the abstract class that implements the core logic for all explode-style generators, including type checking and runtime row expansion. -
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala– Implements theArraysZipexpression (around line 462), which constructs theGenericArrayDataofInternalRowobjects that power thearrays_zipfunction. -
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala– Registers the SQL function names (explode,arrays_zip,inline) and binds them to their corresponding expression classes at line 366. -
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala– Provides theselect,withColumn, andselectExprmethods used to construct the logical plans that invoke these expressions.
Summary
- Never use multiple
explodecalls in the same projection, as this creates separate generator nodes that force a Cartesian product and exponential row expansion. - Use
arrays_zipto merge multiple array columns into a single array of structs, then explode once to preserve positional alignment without joins. - Reference the source: The
ArraysZipexpression incollectionOperations.scalaandExplodeBaseingenerators.scalaenable Whole-Stage Code Generation for optimal performance. - For maps, convert to entries using
map_entriesbefore zipping. - For existing struct arrays, use
inlineto skip the zip step entirely and eliminate struct creation overhead.
Frequently Asked Questions
What happens if I call explode on multiple columns without zipping?
Calling explode on multiple columns in the same SELECT clause creates multiple generator expressions in the logical plan. The Catalyst optimizer must compute the Cartesian product of these generators, resulting in a multiplicative explosion of output rows and significantly increased execution time and memory usage.
Can I use explode_outer with arrays_zip?
Yes. You can substitute explode_outer for explode when working with zipped arrays. This preserves rows where the zipped array is empty or null by returning NULL values for the struct fields, maintaining outer join semantics while still avoiding the Cartesian product penalty of multiple generators.
How do I handle columns of different lengths when exploding multiple columns?
The arrays_zip function aligns elements by position and stops at the length of the shortest array. To retain all elements from longer arrays, pad the shorter arrays with null values using concat and array functions before zipping, or handle the remaining elements in a separate operation.
When should I use inline instead of arrays_zip?
Use inline or inline_outer when your data is already an array of structs rather than separate array columns. This function expands the struct fields directly into columns without requiring the intermediate arrays_zip step, eliminating struct creation overhead and providing the most efficient path for pre-structured array data.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →