# How to Use Spark unionByName to Align DataFrames by Column Name

> Learn how to use spark unionByName to merge DataFrames by column name, not position. Automatically handle missing columns with allowMissingColumns=true for seamless data integration.

- Repository: [The Apache Software Foundation/spark](https://github.com/apache/spark)
- Tags: how-to-guide
- Published: 2026-02-20

---

**Use `Dataset.unionByName` (or the DataFrame alias) to merge DataFrames by matching column names rather than positions, and set `allowMissingColumns=true` to automatically null-fill missing fields.**

When working with evolving data pipelines in the `apache/spark` repository, you often need to combine DataFrames whose schemas share column names but differ in order or completeness. The `spark unionbyname` operation solves this by aligning columns based on their names rather than their positional index, preventing silent data corruption that occurs with standard positional unions.

## How unionByName Works Under the Hood

The `unionByName` method is implemented in `org.apache.spark.sql.classic.Dataset` within [`sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala) (lines 1171–1180). When invoked, it constructs a logical `Union` node with `byName = true` and passes it to the query planner.

The public API surface is declared in [`sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala) (lines 1890–1905), exposing two signatures:

- `def unionByName(other: Dataset[T]): Dataset[T]` — requires identical column sets.
- `def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T]` — when `true`, the resolver adds missing columns as `null`-filled fields at the end of the resulting schema.

During execution, the analyzer resolves the union by matching fields by name. If `allowMissingColumns` is enabled, the planner inserts `null` values for columns present in one DataFrame but not the other; otherwise, it raises an `AnalysisException`.

## Using unionByName in Practice

### Basic Union with Different Column Orders

When DataFrames contain the same columns in different orders, `unionByName` aligns them correctly without manual reordering:

```scala
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")

// Align columns by name → result keeps df1 column order
df1.unionByName(df2).show()

```

Output:

```

+----+----+----+
|col0|col1|col2|
+----+----+----+
|   1|   2|   3|
|   6|   4|   5|
+----+----+----+

```

### Handling Missing Columns with allowMissingColumns

For schema evolution scenarios where columns exist in one DataFrame but not the other, enable `allowMissingColumns`:

```scala
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")

df1.unionByName(df2, allowMissingColumns = true).show()

```

Output:

```

+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
|   1|   2|   3|NULL|
|   5|   4|NULL|   6|
+----+----+----+----+

```

Missing columns are appended to the end of the schema and populated with `null` for rows from the DataFrame lacking that field.

### PySpark Usage

The same functionality is available in PySpark through the DataFrame API:

```python
df1 = spark.createDataFrame([(1, 2, 3)], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([(4, 5, 6)], ["col1", "col0", "col3"])

# unionByName with missing-column support

result = df1.unionByName(df2, allowMissingColumns=True)
result.show()

```

### Chaining Multiple Unions

You can chain `unionByName` calls to combine multiple DataFrames with heterogeneous schemas:

```scala
val dfA = Seq((1, "a")).toDF("id", "val")
val dfB = Seq((2, "b")).toDF("id", "val")
val dfC = Seq((3, "c")).toDF("id", "extra")

val finalDf = dfA.unionByName(dfB).unionByName(dfC, allowMissingColumns = true)
finalDf.show()

```

Output:

```

+---+----+------+
| id| val|extra |
+---+----+------+
|  1|   a|  null|
|  2|   b|  null|
|  3| null|    c|
+---+----+------+

```

## Key Implementation Details and Limitations

According to the `apache/spark` source code, `unionByName` supports **nested `struct` and `array` columns**, but does **not support `map` types**. When `allowMissingColumns` is enabled, the analyzer places missing columns at the end of the resulting schema rather than preserving their original ordinal positions.

The core logic resides in [`sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala), where the method constructs a `Union` logical node with `byName = true`. The public API contract is defined in [`sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala).

## Summary

- Use `spark unionbyname` instead of the standard `union` method when DataFrames share column names but differ in column order.
- Set `allowMissingColumns=true` to safely merge DataFrames with different schemas, automatically filling missing values with `null`.
- The implementation creates a logical `Union` node with `byName=true` in `org.apache.spark.sql.classic.Dataset`.
- The operation supports complex types like structs and arrays, but does not support map types.
- Missing columns are appended to the end of the schema when using the `allowMissingColumns` option.

## Frequently Asked Questions

### What is the difference between union and unionByName in Spark?

Standard `union` aligns columns by their positional index, which causes data corruption if the column orders differ between DataFrames. `unionByName` matches columns by their names, ensuring that values align correctly regardless of their ordinal position in the source DataFrames.

### How do I handle columns that exist in one DataFrame but not the other?

Pass `allowMissingColumns=true` as the second argument to `unionByName`. This instructs the analyzer to add missing columns to the result schema and populate them with `null` values for rows originating from the DataFrame that lacks those columns.

### Does unionByName work with nested structures like structs?

Yes, `unionByName` supports nested `struct` and `array` types. However, it does not currently support `map` types. When merging DataFrames with nested schemas, the column name matching applies to the top-level fields.

### Where is unionByName implemented in the Spark source code?

The public API is declared in [`sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala) (lines 1890–1905), while the concrete implementation resides in [`sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala) (lines 1171–1180). The method constructs a logical `Union` node with `byName=true` that the query planner resolves during analysis.