How to Use Spark unionByName to Align DataFrames by Column Name

Use Dataset.unionByName (or the DataFrame alias) to merge DataFrames by matching column names rather than positions, and set allowMissingColumns=true to automatically null-fill missing fields.

When working with evolving data pipelines in the apache/spark repository, you often need to combine DataFrames whose schemas share column names but differ in order or completeness. The spark unionbyname operation solves this by aligning columns based on their names rather than their positional index, preventing silent data corruption that occurs with standard positional unions.

How unionByName Works Under the Hood

The unionByName method is implemented in org.apache.spark.sql.classic.Dataset within sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala (lines 1171–1180). When invoked, it constructs a logical Union node with byName = true and passes it to the query planner.

The public API surface is declared in sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala (lines 1890–1905), exposing two signatures:

  • def unionByName(other: Dataset[T]): Dataset[T] — requires identical column sets.
  • def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] — when true, the resolver adds missing columns as null-filled fields at the end of the resulting schema.

During execution, the analyzer resolves the union by matching fields by name. If allowMissingColumns is enabled, the planner inserts null values for columns present in one DataFrame but not the other; otherwise, it raises an AnalysisException.

Using unionByName in Practice

Basic Union with Different Column Orders

When DataFrames contain the same columns in different orders, unionByName aligns them correctly without manual reordering:

val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")

// Align columns by name → result keeps df1 column order
df1.unionByName(df2).show()

Output:


+----+----+----+
|col0|col1|col2|
+----+----+----+
|   1|   2|   3|
|   6|   4|   5|
+----+----+----+

Handling Missing Columns with allowMissingColumns

For schema evolution scenarios where columns exist in one DataFrame but not the other, enable allowMissingColumns:

val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")

df1.unionByName(df2, allowMissingColumns = true).show()

Output:


+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
|   1|   2|   3|NULL|
|   5|   4|NULL|   6|
+----+----+----+----+

Missing columns are appended to the end of the schema and populated with null for rows from the DataFrame lacking that field.

PySpark Usage

The same functionality is available in PySpark through the DataFrame API:

df1 = spark.createDataFrame([(1, 2, 3)], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([(4, 5, 6)], ["col1", "col0", "col3"])

# unionByName with missing-column support

result = df1.unionByName(df2, allowMissingColumns=True)
result.show()

Chaining Multiple Unions

You can chain unionByName calls to combine multiple DataFrames with heterogeneous schemas:

val dfA = Seq((1, "a")).toDF("id", "val")
val dfB = Seq((2, "b")).toDF("id", "val")
val dfC = Seq((3, "c")).toDF("id", "extra")

val finalDf = dfA.unionByName(dfB).unionByName(dfC, allowMissingColumns = true)
finalDf.show()

Output:


+---+----+------+
| id| val|extra |
+---+----+------+
|  1|   a|  null|
|  2|   b|  null|
|  3| null|    c|
+---+----+------+

Key Implementation Details and Limitations

According to the apache/spark source code, unionByName supports nested struct and array columns, but does not support map types. When allowMissingColumns is enabled, the analyzer places missing columns at the end of the resulting schema rather than preserving their original ordinal positions.

The core logic resides in sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala, where the method constructs a Union logical node with byName = true. The public API contract is defined in sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala.

Summary

  • Use spark unionbyname instead of the standard union method when DataFrames share column names but differ in column order.
  • Set allowMissingColumns=true to safely merge DataFrames with different schemas, automatically filling missing values with null.
  • The implementation creates a logical Union node with byName=true in org.apache.spark.sql.classic.Dataset.
  • The operation supports complex types like structs and arrays, but does not support map types.
  • Missing columns are appended to the end of the schema when using the allowMissingColumns option.

Frequently Asked Questions

What is the difference between union and unionByName in Spark?

Standard union aligns columns by their positional index, which causes data corruption if the column orders differ between DataFrames. unionByName matches columns by their names, ensuring that values align correctly regardless of their ordinal position in the source DataFrames.

How do I handle columns that exist in one DataFrame but not the other?

Pass allowMissingColumns=true as the second argument to unionByName. This instructs the analyzer to add missing columns to the result schema and populate them with null values for rows originating from the DataFrame that lacks those columns.

Does unionByName work with nested structures like structs?

Yes, unionByName supports nested struct and array types. However, it does not currently support map types. When merging DataFrames with nested schemas, the column name matching applies to the top-level fields.

Where is unionByName implemented in the Spark source code?

The public API is declared in sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala (lines 1890–1905), while the concrete implementation resides in sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala (lines 1171–1180). The method constructs a logical Union node with byName=true that the query planner resolves during analysis.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →