api-reference

Where to Find an Exhaustive List of Actions in Apache Spark: Complete API Reference

February 18, 2026 apache/spark ↗

All built-in Spark actions are defined as public methods in the core API source files, primarily RDD.scala for RDD operations and Dataset.scala for DataFrame/Dataset operations, with additional actions located in DataFrameWriter.scala, DStream.scala, and streaming query classes.

Apache Spark uses lazy evaluation, meaning transformations build a logical execution plan while actions trigger actual computation and return results to the driver. If you need an exhaustive list of actions in Spark, you must examine the source code of the apache/spark repository, where each high-level API exposes its actionable methods in specific Scala trait and class definitions.

What Are Spark Actions?

Actions are operations that execute the underlying computation graph and materialize results. Unlike transformations, which return new RDDs or Datasets, actions either return values to the driver program or write data to external storage systems. Triggering an action forces Spark to schedule and execute tasks across the cluster.

Core API Source Files Containing All Spark Actions

The exhaustive list of actions is distributed across the main API modules. Each file below contains the complete public interface for actionable methods in its respective domain.

RDD Actions

The definitive source for RDD actions is core/src/main/scala/org/apache/spark/rdd/RDD.scala. This file defines over 30 distinct action methods, including:

collect(), count(), first(), take(), takeOrdered(), top()
reduce(), fold(), aggregate()
foreach(), foreachPartition()
saveAsTextFile(), saveAsObjectFile(), saveAsSequenceFile()
saveAsHadoopFile(), saveAsNewAPIHadoopFile()
countByValue(), countApprox(), countApproxDistinct()

Dataset and DataFrame Actions

For the high-level Dataset API (which includes DataFrames), examine sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala. Key actions include:

show(), collect(), collectAsList()
count(), first(), head(), take(), takeAsList()
reduce(), fold()
foreach(), foreachPartition()
write (returns a DataFrameWriter, which contains additional write actions)

DataFrameWriter Actions

Write-specific actions for saving data are defined in sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameWriter.scala (and the corresponding interface in sql/api). These include:

save(), saveAsTable(), insertInto()
mode(), format(), partitionBy(), bucketBy(), options() (configuration methods that precede the final write action)

DStream Actions (Spark Streaming)

For the legacy DStream API, actions are located in streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala:

print(), foreachRDD()
saveAsObjectFiles(), saveAsTextFiles(), saveAsHadoopFiles(), saveAsNewAPIHadoopFiles()

Structured Streaming Actions

Control actions for streaming queries are found in sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala:

awaitTermination(), processAllAvailable(), stop()

GraphX Actions

GraphX operations rely on underlying RDDs, but specific implementations are in graphx/src/main/scala/org/apache/spark/graphx/impl/VertexRDDImpl.scala and EdgeRDDImpl.scala. These inherit standard RDD actions like collect(), count(), and foreach().

How to Identify Actions in the Source Code

To verify whether a method is an action when reading the Spark source, look for method definitions that:

Return concrete values to the driver (e.g., Long, Array[T], List[T]) rather than new RDD/Dataset instances
Invoke runJob() or sc.runJob() internally, which triggers the DAG scheduler
Are annotated with @DeveloperApi or documented as "actions" in the Scaladoc comments

For example, in RDD.scala, the count() method is defined as:

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

The call to sc.runJob confirms this is an action that triggers cluster computation.

Practical Examples of Spark Actions

RDD Action Examples

// core/src/main/scala/org/apache/spark/rdd/RDD.scala
val rdd = sc.parallelize(1 to 1000)

// Aggregation actions
val total: Long = rdd.count()
val sum: Int = rdd.reduce(_ + _)
val aggregated: Int = rdd.fold(0)(_ + _)

// Retrieval actions
val first: Int = rdd.first()
val sample: Array[Int] = rdd.take(10)
val top: Array[Int] = rdd.top(5)

// Side-effect actions
rdd.foreach(println)
rdd.saveAsTextFile("hdfs:///tmp/output")

Dataset and DataFrame Action Examples

// sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala
import spark.implicits._

val ds = Seq((1, "a"), (2, "b")).toDS()

// Collection actions
val rows: Array[(Int, String)] = ds.collect()
val count: Long = ds.count()
val firstRow: (Int, String) = ds.first()

// Display actions
ds.show(5, truncate = false)

// Write actions via DataFrameWriter
// sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameWriter.scala
ds.write.mode("overwrite").saveAsTable("my_table")
ds.write.format("parquet").save("/path/to/output")

Streaming Action Examples

// streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
val stream = ssc.socketTextStream("localhost", 9999)
stream.print()  // DStream action

// sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala
val query = df.writeStream.format("console").start()
query.awaitTermination()  // Structured Streaming action
query.stop()

Summary

RDD actions are exhaustively defined in core/src/main/scala/org/apache/spark/rdd/RDD.scala, including collect, count, reduce, saveAsTextFile, and over 30 others.
Dataset and DataFrame actions reside in sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala, providing show, collect, write, and aggregation methods.
Write-specific actions are implemented in DataFrameWriter.scala and include save, saveAsTable, and insertInto.
Streaming actions are found in DStream.scala for legacy streaming and StreamingQuery.scala for Structured Streaming.
To verify an action in source code, look for methods that invoke runJob() or return concrete values to the driver rather than new RDD/Dataset instances.

Frequently Asked Questions

How do I distinguish between a transformation and an action in the Spark source code?

In the Spark source code, actions invoke sc.runJob() or similar scheduler methods and return concrete values (e.g., Long, Array[T]) to the driver program. Transformations return new RDD or Dataset instances without triggering job execution. For example, in RDD.scala, map() returns a new RDD (transformation), while count() calls sc.runJob(this, ...).sum (action).

Are all Spark actions available in Python, Java, and R, or only in Scala?

All major actions are available across Python (PySpark), Java, Scala, and R (SparkR) APIs, as these language bindings wrap the underlying Scala implementations found in RDD.scala and Dataset.scala. However, some specialized methods or parameter variants may differ slightly between languages. The exhaustive list of actions in the source code serves as the master reference for all language implementations.

Why do some actions like `saveAsTextFile` not return data to the driver?

Actions like saveAsTextFile, saveAsObjectFile, and foreach are side-effect actions that write data to external storage systems or execute arbitrary functions on worker nodes without collecting results back to the driver. These methods still trigger job execution (making them actions), but they return Unit (void) or minimal status information rather than computed datasets. You can identify these in RDD.scala by their return types and their use of runJob for side effects only.

Where can I find actions specific to Structured Streaming?

Structured Streaming actions are located in sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala. Unlike batch actions that return data, streaming actions control query execution lifecycle, including awaitTermination() (blocks until stream ends), processAllAvailable() (processes all available data then stops), and stop() (terminates the query). These complement the write actions found in DataFrameWriter.scala that trigger the actual streaming output.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how apache/spark works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →