Where to Find an Exhaustive List of Actions in Apache Spark: Complete API Reference
All built-in Spark actions are defined as public methods in the core API source files, primarily RDD.scala for RDD operations and Dataset.scala for DataFrame/Dataset operations, with additional actions located in DataFrameWriter.scala, DStream.scala, and streaming query classes.
Apache Spark uses lazy evaluation, meaning transformations build a logical execution plan while actions trigger actual computation and return results to the driver. If you need an exhaustive list of actions in Spark, you must examine the source code of the apache/spark repository, where each high-level API exposes its actionable methods in specific Scala trait and class definitions.
What Are Spark Actions?
Actions are operations that execute the underlying computation graph and materialize results. Unlike transformations, which return new RDDs or Datasets, actions either return values to the driver program or write data to external storage systems. Triggering an action forces Spark to schedule and execute tasks across the cluster.
Core API Source Files Containing All Spark Actions
The exhaustive list of actions is distributed across the main API modules. Each file below contains the complete public interface for actionable methods in its respective domain.
RDD Actions
The definitive source for RDD actions is core/src/main/scala/org/apache/spark/rdd/RDD.scala. This file defines over 30 distinct action methods, including:
collect(),count(),first(),take(),takeOrdered(),top()reduce(),fold(),aggregate()foreach(),foreachPartition()saveAsTextFile(),saveAsObjectFile(),saveAsSequenceFile()saveAsHadoopFile(),saveAsNewAPIHadoopFile()countByValue(),countApprox(),countApproxDistinct()
Dataset and DataFrame Actions
For the high-level Dataset API (which includes DataFrames), examine sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala. Key actions include:
show(),collect(),collectAsList()count(),first(),head(),take(),takeAsList()reduce(),fold()foreach(),foreachPartition()write(returns aDataFrameWriter, which contains additional write actions)
DataFrameWriter Actions
Write-specific actions for saving data are defined in sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameWriter.scala (and the corresponding interface in sql/api). These include:
save(),saveAsTable(),insertInto()mode(),format(),partitionBy(),bucketBy(),options()(configuration methods that precede the final write action)
DStream Actions (Spark Streaming)
For the legacy DStream API, actions are located in streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala:
print(),foreachRDD()saveAsObjectFiles(),saveAsTextFiles(),saveAsHadoopFiles(),saveAsNewAPIHadoopFiles()
Structured Streaming Actions
Control actions for streaming queries are found in sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala:
awaitTermination(),processAllAvailable(),stop()
GraphX Actions
GraphX operations rely on underlying RDDs, but specific implementations are in graphx/src/main/scala/org/apache/spark/graphx/impl/VertexRDDImpl.scala and EdgeRDDImpl.scala. These inherit standard RDD actions like collect(), count(), and foreach().
How to Identify Actions in the Source Code
To verify whether a method is an action when reading the Spark source, look for method definitions that:
- Return concrete values to the driver (e.g.,
Long,Array[T],List[T]) rather than new RDD/Dataset instances - Invoke
runJob()orsc.runJob()internally, which triggers the DAG scheduler - Are annotated with
@DeveloperApior documented as "actions" in the Scaladoc comments
For example, in RDD.scala, the count() method is defined as:
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
The call to sc.runJob confirms this is an action that triggers cluster computation.
Practical Examples of Spark Actions
RDD Action Examples
// core/src/main/scala/org/apache/spark/rdd/RDD.scala
val rdd = sc.parallelize(1 to 1000)
// Aggregation actions
val total: Long = rdd.count()
val sum: Int = rdd.reduce(_ + _)
val aggregated: Int = rdd.fold(0)(_ + _)
// Retrieval actions
val first: Int = rdd.first()
val sample: Array[Int] = rdd.take(10)
val top: Array[Int] = rdd.top(5)
// Side-effect actions
rdd.foreach(println)
rdd.saveAsTextFile("hdfs:///tmp/output")
Dataset and DataFrame Action Examples
// sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala
import spark.implicits._
val ds = Seq((1, "a"), (2, "b")).toDS()
// Collection actions
val rows: Array[(Int, String)] = ds.collect()
val count: Long = ds.count()
val firstRow: (Int, String) = ds.first()
// Display actions
ds.show(5, truncate = false)
// Write actions via DataFrameWriter
// sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameWriter.scala
ds.write.mode("overwrite").saveAsTable("my_table")
ds.write.format("parquet").save("/path/to/output")
Streaming Action Examples
// streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
val stream = ssc.socketTextStream("localhost", 9999)
stream.print() // DStream action
// sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala
val query = df.writeStream.format("console").start()
query.awaitTermination() // Structured Streaming action
query.stop()
Summary
- RDD actions are exhaustively defined in
core/src/main/scala/org/apache/spark/rdd/RDD.scala, includingcollect,count,reduce,saveAsTextFile, and over 30 others. - Dataset and DataFrame actions reside in
sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala, providingshow,collect,write, and aggregation methods. - Write-specific actions are implemented in
DataFrameWriter.scalaand includesave,saveAsTable, andinsertInto. - Streaming actions are found in
DStream.scalafor legacy streaming andStreamingQuery.scalafor Structured Streaming. - To verify an action in source code, look for methods that invoke
runJob()or return concrete values to the driver rather than new RDD/Dataset instances.
Frequently Asked Questions
How do I distinguish between a transformation and an action in the Spark source code?
In the Spark source code, actions invoke sc.runJob() or similar scheduler methods and return concrete values (e.g., Long, Array[T]) to the driver program. Transformations return new RDD or Dataset instances without triggering job execution. For example, in RDD.scala, map() returns a new RDD (transformation), while count() calls sc.runJob(this, ...).sum (action).
Are all Spark actions available in Python, Java, and R, or only in Scala?
All major actions are available across Python (PySpark), Java, Scala, and R (SparkR) APIs, as these language bindings wrap the underlying Scala implementations found in RDD.scala and Dataset.scala. However, some specialized methods or parameter variants may differ slightly between languages. The exhaustive list of actions in the source code serves as the master reference for all language implementations.
Why do some actions like saveAsTextFile not return data to the driver?
Actions like saveAsTextFile, saveAsObjectFile, and foreach are side-effect actions that write data to external storage systems or execute arbitrary functions on worker nodes without collecting results back to the driver. These methods still trigger job execution (making them actions), but they return Unit (void) or minimal status information rather than computed datasets. You can identify these in RDD.scala by their return types and their use of runJob for side effects only.
Where can I find actions specific to Structured Streaming?
Structured Streaming actions are located in sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala. Unlike batch actions that return data, streaming actions control query execution lifecycle, including awaitTermination() (blocks until stream ends), processAllAvailable() (processes all available data then stops), and stop() (terminates the query). These complement the write actions found in DataFrameWriter.scala that trigger the actual streaming output.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →