How to Use Spark unionByName to Align DataFrames by Column Name
Use Dataset.unionByName (or the DataFrame alias) to merge DataFrames by matching column names rather than positions, and set allowMissingColumns=true to automatically null-fill missing fields.
When working with evolving data pipelines in the apache/spark repository, you often need to combine DataFrames whose schemas share column names but differ in order or completeness. The spark unionbyname operation solves this by aligning columns based on their names rather than their positional index, preventing silent data corruption that occurs with standard positional unions.
How unionByName Works Under the Hood
The unionByName method is implemented in org.apache.spark.sql.classic.Dataset within sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala (lines 1171–1180). When invoked, it constructs a logical Union node with byName = true and passes it to the query planner.
The public API surface is declared in sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala (lines 1890–1905), exposing two signatures:
def unionByName(other: Dataset[T]): Dataset[T]— requires identical column sets.def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T]— whentrue, the resolver adds missing columns asnull-filled fields at the end of the resulting schema.
During execution, the analyzer resolves the union by matching fields by name. If allowMissingColumns is enabled, the planner inserts null values for columns present in one DataFrame but not the other; otherwise, it raises an AnalysisException.
Using unionByName in Practice
Basic Union with Different Column Orders
When DataFrames contain the same columns in different orders, unionByName aligns them correctly without manual reordering:
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
// Align columns by name → result keeps df1 column order
df1.unionByName(df2).show()
Output:
+----+----+----+
|col0|col1|col2|
+----+----+----+
| 1| 2| 3|
| 6| 4| 5|
+----+----+----+
Handling Missing Columns with allowMissingColumns
For schema evolution scenarios where columns exist in one DataFrame but not the other, enable allowMissingColumns:
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")
df1.unionByName(df2, allowMissingColumns = true).show()
Output:
+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
| 1| 2| 3|NULL|
| 5| 4|NULL| 6|
+----+----+----+----+
Missing columns are appended to the end of the schema and populated with null for rows from the DataFrame lacking that field.
PySpark Usage
The same functionality is available in PySpark through the DataFrame API:
df1 = spark.createDataFrame([(1, 2, 3)], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([(4, 5, 6)], ["col1", "col0", "col3"])
# unionByName with missing-column support
result = df1.unionByName(df2, allowMissingColumns=True)
result.show()
Chaining Multiple Unions
You can chain unionByName calls to combine multiple DataFrames with heterogeneous schemas:
val dfA = Seq((1, "a")).toDF("id", "val")
val dfB = Seq((2, "b")).toDF("id", "val")
val dfC = Seq((3, "c")).toDF("id", "extra")
val finalDf = dfA.unionByName(dfB).unionByName(dfC, allowMissingColumns = true)
finalDf.show()
Output:
+---+----+------+
| id| val|extra |
+---+----+------+
| 1| a| null|
| 2| b| null|
| 3| null| c|
+---+----+------+
Key Implementation Details and Limitations
According to the apache/spark source code, unionByName supports nested struct and array columns, but does not support map types. When allowMissingColumns is enabled, the analyzer places missing columns at the end of the resulting schema rather than preserving their original ordinal positions.
The core logic resides in sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala, where the method constructs a Union logical node with byName = true. The public API contract is defined in sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala.
Summary
- Use
spark unionbynameinstead of the standardunionmethod when DataFrames share column names but differ in column order. - Set
allowMissingColumns=trueto safely merge DataFrames with different schemas, automatically filling missing values withnull. - The implementation creates a logical
Unionnode withbyName=trueinorg.apache.spark.sql.classic.Dataset. - The operation supports complex types like structs and arrays, but does not support map types.
- Missing columns are appended to the end of the schema when using the
allowMissingColumnsoption.
Frequently Asked Questions
What is the difference between union and unionByName in Spark?
Standard union aligns columns by their positional index, which causes data corruption if the column orders differ between DataFrames. unionByName matches columns by their names, ensuring that values align correctly regardless of their ordinal position in the source DataFrames.
How do I handle columns that exist in one DataFrame but not the other?
Pass allowMissingColumns=true as the second argument to unionByName. This instructs the analyzer to add missing columns to the result schema and populate them with null values for rows originating from the DataFrame that lacks those columns.
Does unionByName work with nested structures like structs?
Yes, unionByName supports nested struct and array types. However, it does not currently support map types. When merging DataFrames with nested schemas, the column name matching applies to the top-level fields.
Where is unionByName implemented in the Spark source code?
The public API is declared in sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala (lines 1890–1905), while the concrete implementation resides in sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala (lines 1171–1180). The method constructs a logical Union node with byName=true that the query planner resolves during analysis.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →