# Where to Find an Exhaustive List of Actions in Apache Spark: Complete API Reference

> Discover where to find an exhaustive list of actions in Apache Spark. Explore the official RDD scala and Dataset scala source files for a complete API reference to optimize your Spark applications.

- Repository: [The Apache Software Foundation/spark](https://github.com/apache/spark)
- Tags: api-reference
- Published: 2026-02-18

---

**All built-in Spark actions are defined as public methods in the core API source files, primarily [`RDD.scala`](https://github.com/apache/spark/blob/main/RDD.scala) for RDD operations and [`Dataset.scala`](https://github.com/apache/spark/blob/main/Dataset.scala) for DataFrame/Dataset operations, with additional actions located in [`DataFrameWriter.scala`](https://github.com/apache/spark/blob/main/DataFrameWriter.scala), [`DStream.scala`](https://github.com/apache/spark/blob/main/DStream.scala), and streaming query classes.**

Apache Spark uses lazy evaluation, meaning transformations build a logical execution plan while **actions** trigger actual computation and return results to the driver. If you need an exhaustive list of actions in Spark, you must examine the source code of the `apache/spark` repository, where each high-level API exposes its actionable methods in specific Scala trait and class definitions.

## What Are Spark Actions?

Actions are operations that execute the underlying computation graph and materialize results. Unlike transformations, which return new RDDs or Datasets, actions either return values to the driver program or write data to external storage systems. Triggering an action forces Spark to schedule and execute tasks across the cluster.

## Core API Source Files Containing All Spark Actions

The exhaustive list of actions is distributed across the main API modules. Each file below contains the complete public interface for actionable methods in its respective domain.

### RDD Actions

The definitive source for RDD actions is [`core/src/main/scala/org/apache/spark/rdd/RDD.scala`](https://github.com/apache/spark/blob/main/core/src/main/scala/org/apache/spark/rdd/RDD.scala). This file defines over 30 distinct action methods, including:

- `collect()`, `count()`, `first()`, `take()`, `takeOrdered()`, `top()`
- `reduce()`, `fold()`, `aggregate()`
- `foreach()`, `foreachPartition()`
- `saveAsTextFile()`, `saveAsObjectFile()`, `saveAsSequenceFile()`
- `saveAsHadoopFile()`, `saveAsNewAPIHadoopFile()`
- `countByValue()`, `countApprox()`, `countApproxDistinct()`

### Dataset and DataFrame Actions

For the high-level Dataset API (which includes DataFrames), examine [`sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala). Key actions include:

- `show()`, `collect()`, `collectAsList()`
- `count()`, `first()`, `head()`, `take()`, `takeAsList()`
- `reduce()`, `fold()`
- `foreach()`, `foreachPartition()`
- `write` (returns a `DataFrameWriter`, which contains additional write actions)

### DataFrameWriter Actions

Write-specific actions for saving data are defined in [`sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameWriter.scala`](https://github.com/apache/spark/blob/main/sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameWriter.scala) (and the corresponding interface in `sql/api`). These include:

- `save()`, `saveAsTable()`, `insertInto()`
- `mode()`, `format()`, `partitionBy()`, `bucketBy()`, `options()` (configuration methods that precede the final write action)

### DStream Actions (Spark Streaming)

For the legacy DStream API, actions are located in [`streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala`](https://github.com/apache/spark/blob/main/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala):

- `print()`, `foreachRDD()`
- `saveAsObjectFiles()`, `saveAsTextFiles()`, `saveAsHadoopFiles()`, `saveAsNewAPIHadoopFiles()`

### Structured Streaming Actions

Control actions for streaming queries are found in [`sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala`](https://github.com/apache/spark/blob/main/sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala):

- `awaitTermination()`, `processAllAvailable()`, `stop()`

### GraphX Actions

GraphX operations rely on underlying RDDs, but specific implementations are in [`graphx/src/main/scala/org/apache/spark/graphx/impl/VertexRDDImpl.scala`](https://github.com/apache/spark/blob/main/graphx/src/main/scala/org/apache/spark/graphx/impl/VertexRDDImpl.scala) and [`EdgeRDDImpl.scala`](https://github.com/apache/spark/blob/main/EdgeRDDImpl.scala). These inherit standard RDD actions like `collect()`, `count()`, and `foreach()`.

## How to Identify Actions in the Source Code

To verify whether a method is an action when reading the Spark source, look for method definitions that:

1. Return concrete values to the driver (e.g., `Long`, `Array[T]`, `List[T]`) rather than new RDD/Dataset instances
2. Invoke `runJob()` or `sc.runJob()` internally, which triggers the DAG scheduler
3. Are annotated with `@DeveloperApi` or documented as "actions" in the Scaladoc comments

For example, in [`RDD.scala`](https://github.com/apache/spark/blob/main/RDD.scala), the `count()` method is defined as:

```scala
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

```

The call to `sc.runJob` confirms this is an action that triggers cluster computation.

## Practical Examples of Spark Actions

### RDD Action Examples

```scala
// core/src/main/scala/org/apache/spark/rdd/RDD.scala
val rdd = sc.parallelize(1 to 1000)

// Aggregation actions
val total: Long = rdd.count()
val sum: Int = rdd.reduce(_ + _)
val aggregated: Int = rdd.fold(0)(_ + _)

// Retrieval actions
val first: Int = rdd.first()
val sample: Array[Int] = rdd.take(10)
val top: Array[Int] = rdd.top(5)

// Side-effect actions
rdd.foreach(println)
rdd.saveAsTextFile("hdfs:///tmp/output")

```

### Dataset and DataFrame Action Examples

```scala
// sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala
import spark.implicits._

val ds = Seq((1, "a"), (2, "b")).toDS()

// Collection actions
val rows: Array[(Int, String)] = ds.collect()
val count: Long = ds.count()
val firstRow: (Int, String) = ds.first()

// Display actions
ds.show(5, truncate = false)

// Write actions via DataFrameWriter
// sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameWriter.scala
ds.write.mode("overwrite").saveAsTable("my_table")
ds.write.format("parquet").save("/path/to/output")

```

### Streaming Action Examples

```scala
// streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
val stream = ssc.socketTextStream("localhost", 9999)
stream.print()  // DStream action

// sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala
val query = df.writeStream.format("console").start()
query.awaitTermination()  // Structured Streaming action
query.stop()

```

## Summary

- **RDD actions** are exhaustively defined in [`core/src/main/scala/org/apache/spark/rdd/RDD.scala`](https://github.com/apache/spark/blob/main/core/src/main/scala/org/apache/spark/rdd/RDD.scala), including `collect`, `count`, `reduce`, `saveAsTextFile`, and over 30 others.
- **Dataset and DataFrame actions** reside in [`sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala`](https://github.com/apache/spark/blob/main/sql/api/src/main/scala/org/apache/spark/sql/Dataset.scala), providing `show`, `collect`, `write`, and aggregation methods.
- **Write-specific actions** are implemented in [`DataFrameWriter.scala`](https://github.com/apache/spark/blob/main/DataFrameWriter.scala) and include `save`, `saveAsTable`, and `insertInto`.
- **Streaming actions** are found in [`DStream.scala`](https://github.com/apache/spark/blob/main/DStream.scala) for legacy streaming and [`StreamingQuery.scala`](https://github.com/apache/spark/blob/main/StreamingQuery.scala) for Structured Streaming.
- To verify an action in source code, look for methods that invoke `runJob()` or return concrete values to the driver rather than new RDD/Dataset instances.

## Frequently Asked Questions

### How do I distinguish between a transformation and an action in the Spark source code?

In the Spark source code, **actions** invoke `sc.runJob()` or similar scheduler methods and return concrete values (e.g., `Long`, `Array[T]`) to the driver program. **Transformations** return new RDD or Dataset instances without triggering job execution. For example, in [`RDD.scala`](https://github.com/apache/spark/blob/main/RDD.scala), `map()` returns a new `RDD` (transformation), while `count()` calls `sc.runJob(this, ...).sum` (action).

### Are all Spark actions available in Python, Java, and R, or only in Scala?

All major actions are available across **Python (PySpark)**, **Java**, **Scala**, and **R (SparkR)** APIs, as these language bindings wrap the underlying Scala implementations found in [`RDD.scala`](https://github.com/apache/spark/blob/main/RDD.scala) and [`Dataset.scala`](https://github.com/apache/spark/blob/main/Dataset.scala). However, some specialized methods or parameter variants may differ slightly between languages. The exhaustive list of actions in the source code serves as the master reference for all language implementations.

### Why do some actions like `saveAsTextFile` not return data to the driver?

Actions like `saveAsTextFile`, `saveAsObjectFile`, and `foreach` are **side-effect actions** that write data to external storage systems or execute arbitrary functions on worker nodes without collecting results back to the driver. These methods still trigger job execution (making them actions), but they return `Unit` (void) or minimal status information rather than computed datasets. You can identify these in [`RDD.scala`](https://github.com/apache/spark/blob/main/RDD.scala) by their return types and their use of `runJob` for side effects only.

### Where can I find actions specific to Structured Streaming?

Structured Streaming actions are located in [`sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala`](https://github.com/apache/spark/blob/main/sql/streaming/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala). Unlike batch actions that return data, streaming actions control query execution lifecycle, including `awaitTermination()` (blocks until stream ends), `processAllAvailable()` (processes all available data then stops), and `stop()` (terminates the query). These complement the write actions found in [`DataFrameWriter.scala`](https://github.com/apache/spark/blob/main/DataFrameWriter.scala) that trigger the actual streaming output.