Apache Hive vs Spark: Execution Model Differences Developers Must Know

Apache Hive compiles queries into eager MapReduce jobs that materialize intermediate results to HDFS, while Apache Spark uses a lazy DAG execution model that pipelines operators in memory and only triggers computation when actions are called.

When evaluating apache hive vs spark for big data processing, understanding their execution architectures is critical for performance tuning. While both engines support SQL-like querying over distributed datasets, the apache/spark repository reveals a fundamentally different approach to query execution compared to Hive's traditional MapReduce foundation.

Execution Philosophy: Eager Materialization vs Lazy Evaluation

The core distinction between these engines lies in when and how they execute query plans.

Apache Hive follows an eager execution model. When a HiveQL statement is submitted, the engine immediately parses the query, builds a query block, and translates it into one or more MapReduce (or Tez) jobs. Each job materializes its intermediate output to HDFS before the next job begins, creating a rigid compilation-execution boundary that is visible to the user.

Apache Spark employs a lazy evaluation model. The engine parses SQL or DataFrame operations and constructs a Catalyst logical plan, but no computation occurs until an action (such as collect(), show(), or write) is invoked. This allows Spark to optimize the entire workflow before execution begins.

How Spark's Lazy DAG Execution Works

Catalyst Optimizer and Logical Planning

Spark SQL queries enter through the Catalyst optimizer, which transforms the logical plan through rule-based and cost-based optimizations. The SparkPlanner class in sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala collects execution strategies and produces a physical plan that remains unevaluated until triggered by an action.

Transformations such as filter, select, and join simply modify the logical plan tree without executing code:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Apache Hive vs Spark Demo")
  .enableHiveSupport()
  .getOrCreate()

// Transformations are lazy - only building a logical plan
val df = spark.read.table("sales")
  .filter($"region" === "US")
  .groupBy($"product")
  .agg(sum($"amount").as("total"))

// No computation yet - just optimized plans
df.explain(true)

DAGScheduler and Physical Execution

When an action is called, Spark's DAGScheduler (implemented in core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala) breaks the physical plan into stages—sets of tasks that can be pipelined without shuffling. The scheduler submits these stages to the TaskScheduler, which manages execution across the cluster.

Key methods in DAGScheduler.scala include:

  • submitStage() - Initiates stage submission
  • resubmitFailedStages() - Handles fault tolerance by retrying failed stages without HDFS materialization

Unlike Hive's MapReduce model, Spark stages pipeline multiple operators (filter → project → aggregation) into single tasks, keeping intermediate data in memory or spilling to disk without writing to HDFS.

Hive's Execution Model: Query Blocks to MapReduce Jobs

Apache Hive translates HiveQL statements into query blocks that map to MapReduce, Tez, or Flink jobs. This process is eager: once the query is parsed, the engine immediately launches the first job.

Each job follows a rigid pattern:

  1. Map phase reads input data and applies filters/projections
  2. Shuffle sorts and transfers data to reducers
  3. Reduce phase performs aggregations or joins
  4. Output is materialized to HDFS

This materialization boundary means Hive cannot pipeline operations across job boundaries. If a query requires three MapReduce jobs, data is written to and read from HDFS twice between stages, creating significant I/O overhead.

Practical Comparison: Code Examples

Spark DataFrame API: Lazy Execution

The following example demonstrates Spark's lazy evaluation through the explain() and show() methods:

// Lazy transformations build a plan
val filteredDF = spark.table("transactions")
  .filter($"date" >= "2024-01-01")
  .join(spark.table("customers"), "customer_id")
  .select("name", "amount")

// View the optimized physical plan without executing
filteredDF.explain(true)

// Action triggers DAGScheduler execution
filteredDF.write.parquet("output/")

The explain(true) call reveals the Catalyst logical and physical plans generated by SparkPlanner.scala before any data is processed.

Spark SQL with Hive Support

Spark can query Hive tables while maintaining its lazy DAG execution:

// Using Hive metastore but Spark's execution engine
spark.sql("""
  SELECT category, COUNT(*) as cnt
  FROM hive_table
  WHERE year = 2024
  GROUP BY category
""").show()

The SQL string is parsed, transformed by HiveStrategies into a HiveTableScanExec, and executed as a Spark DAG—not as separate MapReduce jobs.

Classic Hive on Hadoop (Eager Execution)


# Hive CLI submits eager MapReduce jobs

hive -e "
  SELECT department, AVG(salary) as avg_sal
  FROM employees
  WHERE hire_date > '2020-01-01'
  GROUP BY department;
"

The Hive CLI compiles the query into one or more MapReduce jobs. Each job writes intermediate output to HDFS before the next begins, reflecting Hive's eager execution model.

Demonstrating Spark's Fault Tolerance

Spark's DAGScheduler handles failures without HDFS materialization:

// Force a task failure to observe resubmission
spark.sparkContext.parallelize(1 to 100, 4).map { i =>
  require(i != 42, "Simulated failure")  // fails on partition containing 42
  i * 2
}.collect()

When a task fails, DAGScheduler.resubmitFailedStages (see DAGScheduler.scala) automatically retries the stage without restarting the whole application or reading from HDFS.

Key Source Files in Apache Spark

Understanding the execution model requires examining these specific files in the apache/spark repository:

Component Source File Key Implementation Details
DAG Scheduling core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala submitStage, resubmitFailedStages, stage-task mapping, and fault tolerance logic
Physical Planning sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala Strategy collection, physical plan creation from logical plans
Hive Strategy Registration sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala HiveTableScans object that bridges Hive relations to Spark scans
Physical Scan Implementation sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala Column and partition pruning, predicate binding, HadoopTableReader integration
Session Integration sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala Integration of Hive metastore with Spark's SessionState while maintaining DAG execution

These files demonstrate how Spark maintains its lazy DAG execution model even when processing Hive-compatible queries, fundamentally differing from Hive's eager MapReduce approach.

Summary

  • Apache Hive uses an eager execution model that compiles HiveQL into MapReduce or Tez jobs, materializing intermediate results to HDFS between stages.
  • Apache Spark employs a lazy DAG execution model where transformations build a logical plan via the Catalyst optimizer, and the DAGScheduler pipelines operations into stages only when actions trigger execution.
  • Spark's in-memory pipelining avoids HDFS materialization between stages, while Hive's disk-based materialization provides durability at the cost of latency.
  • Spark integrates with Hive through HiveStrategies and HiveSessionStateBuilder, allowing Hive metastore access while maintaining Spark's lazy execution semantics.
  • Fault tolerance differs significantly: Hive relies on HDFS replication of intermediate outputs, while Spark uses DAGScheduler.resubmitFailedStages to retry failed tasks without HDFS overhead.

Frequently Asked Questions

Can Apache Spark run existing Hive queries without modification?

Yes. Spark provides Hive compatibility through the HiveSessionStateBuilder in sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala, which connects to the Hive metastore and parses HiveQL syntax. However, the execution follows Spark's lazy DAG model via HiveStrategies rather than Hive's MapReduce engine, often resulting in better performance for the same queries.

Which engine provides better fault tolerance for long-running ETL jobs?

Both engines handle failures differently. Hive writes intermediate results to HDFS after each MapReduce job, allowing recovery from any completed stage but incurring high I/O overhead. Spark keeps intermediate data in memory (or spills to local disk) and relies on DAGScheduler.resubmitFailedStages to retry only failed tasks. For long-running pipelines, Spark's approach is generally faster, though Hive's HDFS materialization provides stronger durability guarantees between major stages.

How does partition pruning differ between Hive and Spark?

Both engines support partition pruning, but the implementation differs. Hive implements pruning in the HiveTableScan operator, sending predicates to the MapReduce job configuration. Spark implements this through HiveTableScanExec in sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala, where partition predicates are extracted during the Catalyst optimization phase and applied during the physical scan, allowing the lazy DAG to skip irrelevant data before execution begins.

When should developers choose Hive over Spark for data processing?

Choose Hive when you need strict SQL compliance with ACID transactions (Hive 3.0+), when your infrastructure requires HDFS materialization for auditability, or when running simple batch ETL where startup overhead dominates execution time. Choose Spark for iterative algorithms, interactive analytics, machine learning pipelines, or when you need to minimize latency through in-memory caching and pipelined DAG execution.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →