# Apache Hive vs Spark: Execution Model Differences Developers Must Know

> Understand Apache Hive vs Spark execution models. Hive uses eager MapReduce, Spark uses lazy DAG execution for faster data processing. Learn the key differences.

- Repository: [The Apache Software Foundation/spark](https://github.com/apache/spark)
- Tags: deep-dive
- Published: 2026-02-16

---

**Apache Hive compiles queries into eager MapReduce jobs that materialize intermediate results to HDFS, while Apache Spark uses a lazy DAG execution model that pipelines operators in memory and only triggers computation when actions are called.**

When evaluating **apache hive vs spark** for big data processing, understanding their execution architectures is critical for performance tuning. While both engines support SQL-like querying over distributed datasets, the `apache/spark` repository reveals a fundamentally different approach to query execution compared to Hive's traditional MapReduce foundation.

## Execution Philosophy: Eager Materialization vs Lazy Evaluation

The core distinction between these engines lies in when and how they execute query plans.

**Apache Hive** follows an **eager execution model**. When a HiveQL statement is submitted, the engine immediately parses the query, builds a query block, and translates it into one or more MapReduce (or Tez) jobs. Each job materializes its intermediate output to HDFS before the next job begins, creating a rigid compilation-execution boundary that is visible to the user.

**Apache Spark** employs a **lazy evaluation model**. The engine parses SQL or DataFrame operations and constructs a **Catalyst logical plan**, but no computation occurs until an **action** (such as `collect()`, `show()`, or `write`) is invoked. This allows Spark to optimize the entire workflow before execution begins.

## How Spark's Lazy DAG Execution Works

### Catalyst Optimizer and Logical Planning

Spark SQL queries enter through the **Catalyst optimizer**, which transforms the logical plan through rule-based and cost-based optimizations. The `SparkPlanner` class in [`sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala`](https://github.com/apache/spark/blob/main/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala) collects execution strategies and produces a physical plan that remains unevaluated until triggered by an action.

Transformations such as `filter`, `select`, and `join` simply modify the logical plan tree without executing code:

```scala
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Apache Hive vs Spark Demo")
  .enableHiveSupport()
  .getOrCreate()

// Transformations are lazy - only building a logical plan
val df = spark.read.table("sales")
  .filter($"region" === "US")
  .groupBy($"product")
  .agg(sum($"amount").as("total"))

// No computation yet - just optimized plans
df.explain(true)

```

### DAGScheduler and Physical Execution

When an action is called, Spark's **DAGScheduler** (implemented in [`core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala`](https://github.com/apache/spark/blob/main/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala)) breaks the physical plan into **stages**—sets of tasks that can be pipelined without shuffling. The scheduler submits these stages to the TaskScheduler, which manages execution across the cluster.

Key methods in [`DAGScheduler.scala`](https://github.com/apache/spark/blob/main/DAGScheduler.scala) include:
- `submitStage()` - Initiates stage submission
- `resubmitFailedStages()` - Handles fault tolerance by retrying failed stages without HDFS materialization

Unlike Hive's MapReduce model, Spark stages pipeline multiple operators (filter → project → aggregation) into single tasks, keeping intermediate data in memory or spilling to disk without writing to HDFS.

## Hive's Execution Model: Query Blocks to MapReduce Jobs

Apache Hive translates HiveQL statements into **query blocks** that map to MapReduce, Tez, or Flink jobs. This process is **eager**: once the query is parsed, the engine immediately launches the first job.

Each job follows a rigid pattern:
1. **Map phase** reads input data and applies filters/projections
2. **Shuffle** sorts and transfers data to reducers
3. **Reduce phase** performs aggregations or joins
4. **Output** is materialized to HDFS

This materialization boundary means Hive cannot pipeline operations across job boundaries. If a query requires three MapReduce jobs, data is written to and read from HDFS twice between stages, creating significant I/O overhead.

## Practical Comparison: Code Examples

### Spark DataFrame API: Lazy Execution

The following example demonstrates Spark's lazy evaluation through the `explain()` and `show()` methods:

```scala
// Lazy transformations build a plan
val filteredDF = spark.table("transactions")
  .filter($"date" >= "2024-01-01")
  .join(spark.table("customers"), "customer_id")
  .select("name", "amount")

// View the optimized physical plan without executing
filteredDF.explain(true)

// Action triggers DAGScheduler execution
filteredDF.write.parquet("output/")

```

*The `explain(true)` call reveals the Catalyst logical and physical plans generated by [`SparkPlanner.scala`](https://github.com/apache/spark/blob/main/SparkPlanner.scala) before any data is processed.*

### Spark SQL with Hive Support

Spark can query Hive tables while maintaining its lazy DAG execution:

```scala
// Using Hive metastore but Spark's execution engine
spark.sql("""
  SELECT category, COUNT(*) as cnt
  FROM hive_table
  WHERE year = 2024
  GROUP BY category
""").show()

```

*The SQL string is parsed, transformed by `HiveStrategies` into a `HiveTableScanExec`, and executed as a Spark DAG—not as separate MapReduce jobs.*

### Classic Hive on Hadoop (Eager Execution)

```bash

# Hive CLI submits eager MapReduce jobs

hive -e "
  SELECT department, AVG(salary) as avg_sal
  FROM employees
  WHERE hire_date > '2020-01-01'
  GROUP BY department;
"

```

*The Hive CLI compiles the query into one or more MapReduce jobs. Each job writes intermediate output to HDFS before the next begins, reflecting Hive's eager execution model.*

### Demonstrating Spark's Fault Tolerance

Spark's `DAGScheduler` handles failures without HDFS materialization:

```scala
// Force a task failure to observe resubmission
spark.sparkContext.parallelize(1 to 100, 4).map { i =>
  require(i != 42, "Simulated failure")  // fails on partition containing 42
  i * 2
}.collect()

```

*When a task fails, `DAGScheduler.resubmitFailedStages` (see [`DAGScheduler.scala`](https://github.com/apache/spark/blob/main/DAGScheduler.scala)) automatically retries the stage without restarting the whole application or reading from HDFS.*

## Key Source Files in Apache Spark

Understanding the execution model requires examining these specific files in the `apache/spark` repository:

| Component | Source File | Key Implementation Details |
|-----------|-------------|---------------------------|
| **DAG Scheduling** | [`core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala`](https://github.com/apache/spark/blob/main/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala) | `submitStage`, `resubmitFailedStages`, stage-task mapping, and fault tolerance logic |
| **Physical Planning** | [`sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala`](https://github.com/apache/spark/blob/main/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala) | Strategy collection, physical plan creation from logical plans |
| **Hive Strategy Registration** | [`sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala`](https://github.com/apache/spark/blob/main/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala) | `HiveTableScans` object that bridges Hive relations to Spark scans |
| **Physical Scan Implementation** | [`sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala`](https://github.com/apache/spark/blob/main/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala) | Column and partition pruning, predicate binding, HadoopTableReader integration |
| **Session Integration** | [`sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala`](https://github.com/apache/spark/blob/main/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala) | Integration of Hive metastore with Spark's SessionState while maintaining DAG execution |

These files demonstrate how Spark maintains its lazy DAG execution model even when processing Hive-compatible queries, fundamentally differing from Hive's eager MapReduce approach.

## Summary

- **Apache Hive** uses an **eager execution model** that compiles HiveQL into MapReduce or Tez jobs, materializing intermediate results to HDFS between stages.
- **Apache Spark** employs a **lazy DAG execution model** where transformations build a logical plan via the Catalyst optimizer, and the `DAGScheduler` pipelines operations into stages only when actions trigger execution.
- Spark's **in-memory pipelining** avoids HDFS materialization between stages, while Hive's **disk-based materialization** provides durability at the cost of latency.
- Spark integrates with Hive through `HiveStrategies` and `HiveSessionStateBuilder`, allowing Hive metastore access while maintaining Spark's lazy execution semantics.
- Fault tolerance differs significantly: Hive relies on HDFS replication of intermediate outputs, while Spark uses `DAGScheduler.resubmitFailedStages` to retry failed tasks without HDFS overhead.

## Frequently Asked Questions

### Can Apache Spark run existing Hive queries without modification?

Yes. Spark provides Hive compatibility through the `HiveSessionStateBuilder` in [`sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala`](https://github.com/apache/spark/blob/main/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala), which connects to the Hive metastore and parses HiveQL syntax. However, the execution follows Spark's lazy DAG model via `HiveStrategies` rather than Hive's MapReduce engine, often resulting in better performance for the same queries.

### Which engine provides better fault tolerance for long-running ETL jobs?

Both engines handle failures differently. Hive writes intermediate results to HDFS after each MapReduce job, allowing recovery from any completed stage but incurring high I/O overhead. Spark keeps intermediate data in memory (or spills to local disk) and relies on `DAGScheduler.resubmitFailedStages` to retry only failed tasks. For long-running pipelines, Spark's approach is generally faster, though Hive's HDFS materialization provides stronger durability guarantees between major stages.

### How does partition pruning differ between Hive and Spark?

Both engines support partition pruning, but the implementation differs. Hive implements pruning in the `HiveTableScan` operator, sending predicates to the MapReduce job configuration. Spark implements this through `HiveTableScanExec` in [`sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala`](https://github.com/apache/spark/blob/main/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala), where partition predicates are extracted during the Catalyst optimization phase and applied during the physical scan, allowing the lazy DAG to skip irrelevant data before execution begins.

### When should developers choose Hive over Spark for data processing?

Choose Hive when you need strict SQL compliance with ACID transactions (Hive 3.0+), when your infrastructure requires HDFS materialization for auditability, or when running simple batch ETL where startup overhead dominates execution time. Choose Spark for iterative algorithms, interactive analytics, machine learning pipelines, or when you need to minimize latency through in-memory caching and pipelined DAG execution.