# spark | The Apache Software Foundation | Knowledge Base | Instagit

Apache Spark - A unified analytics engine for large-scale data processing

GitHub Stars: 42.8k

Repository: https://github.com/apache/spark

---

## Articles

### [What Is an Executor in Spark Standalone Mode? Key Differences from Workers and Cores](/apache/spark/what-is-executor-in-spark-standalone-cluster)

Understand the executor in Spark standalone mode. Learn how executors differ from workers and cores, and how they run tasks and cache data.

- Tags: deep-dive
- Published: 2026-02-21

### [How to Use Spark unionByName to Align DataFrames by Column Name](/apache/spark/spark-unionbyname-align-cols-by-name)

Learn how to use spark unionByName to merge DataFrames by column name, not position. Automatically handle missing columns with allowMissingColumns=true for seamless data integration.

- Tags: how-to-guide
- Published: 2026-02-20

### [Narrow vs Wide Transformations in Spark: Performance and Execution Impact](/apache/spark/narrow-and-wide-transformation-in-spark-performance-question)

Understand narrow vs wide transformations in Spark. Learn how these distinct data processing methods impact performance, execution, and Spark stage boundaries for faster big data analytics.

- Tags: deep-dive
- Published: 2026-02-20

### [Common Issues When Configuring the Spark Cassandra Connector for Large Datasets](/apache/spark/spark-cassandra-connector-large-dataset-issues)

Resolve large dataset issues with the Spark Cassandra connector. Learn to configure split size, batch size, and consistency levels to overcome partitioning bottlenecks and coordinator overload.

- Tags: best-practices
- Published: 2026-02-19

### [Converting a Spark DataFrame to a Pandas DataFrame: Arrow Optimization and Performance Pitfalls](/apache/spark/pyspark-pandas-performance-differences)

Learn how to convert Spark DataFrames to Pandas with Arrow optimization. Understand key performance differences and avoid common pitfalls. Ensure your data fits driver memory.

- Tags: performance
- Published: 2026-02-19

### [Where to Find an Exhaustive List of Actions in Apache Spark: Complete API Reference](/apache/spark/where-find-exhaustive-list-actions-in-spark)

Discover where to find an exhaustive list of actions in Apache Spark. Explore the official RDD scala and Dataset scala source files for a complete API reference to optimize your Spark applications.

- Tags: api-reference
- Published: 2026-02-18

### [How to Use Spark JDBC to Read and Write Database Data in PySpark](/apache/spark/spark-jdbc-py-db-read-write)

Learn to effectively use Spark JDBC to read and write database data in PySpark. Treat relational tables as DataFrames and manage connections efficiently.

- Tags: how-to-guide
- Published: 2026-02-18

### [How to Use the Spark Show Command to Display DataFrames in Table Format](/apache/spark/spark-show-dataframe-table)

Learn how to use the spark show command to display DataFrames in a formatted ASCII table. Control rows, truncation, and layout for clear data visualization.

- Tags: how-to-guide
- Published: 2026-02-16

### [Optimizing Spark Partitions to Eliminate Performance Bottlenecks: A Complete Guide](/apache/spark/optimize-spark-partitions-for-speed)

Eliminate Spark performance bottlenecks by optimizing spark partitions. Learn to tune spark.sql.shuffle.partitions and use Adaptive Query Execution for faster job execution.

- Tags: how-to-guide
- Published: 2026-02-16

### [Apache Hive vs Spark: Execution Model Differences Developers Must Know](/apache/spark/apache-hive-vs-spark-execution-models)

Understand Apache Hive vs Spark execution models. Hive uses eager MapReduce, Spark uses lazy DAG execution for faster data processing. Learn the key differences.

- Tags: deep-dive
- Published: 2026-02-16

### [Spark Rename Column in PySpark: Efficient Methods for Large Datasets](/apache/spark/most-efficient-spark-rename-column)

Learn the most efficient Spark rename column methods in PySpark for large datasets. Use withColumnRenamed or withColumnsRenamed for metadata-only updates and zero data movement.

- Tags: how-to-guide
- Published: 2026-02-16

### [How to Spark Explode Multiple Columns Efficiently and Avoid Cartesian Products](/apache/spark/spark-explode-multiple-columns)

Learn how to efficiently explode multiple columns in Spark SQL using arrays_zip to combine arrays and avoid Cartesian products. Improve your performance now.

- Tags: how-to-guide
- Published: 2026-02-16

### [How to Effectively Implement a Spark Filter Operation for Large Datasets](/apache/spark/how-spark-filter-rows)

Learn how to effectively implement Spark filter operations on large datasets. Optimize row selection with column-based expressions and Catalyst for improved performance in Apache Spark.

- Tags: how-to-guide
- Published: 2026-02-16

### [Spark Checkpoint Versus Persist to Disk: Functional Differences for Fault Tolerance](/apache/spark/spark-checkpoint-vs-persist-disk-fault-tolerance)

Understand the functional differences between Spark checkpointing and persisting to disk for fault tolerance. Learn how each handles lineage and driver restarts efficiently.

- Tags: deep-dive
- Published: 2026-02-16

### [Flink vs Spark: Core Architectural Trade-offs for Big Data Projects](/apache/spark/flink-vs-spark-processing-framework-choice)

Compare Flink vs Spark for big data projects. Discover core architectural trade-offs. Choose Flink for low-latency streaming or Spark for batch analytics and micro-batch.

- Tags: comparison
- Published: 2026-02-16

### [Repartition in Spark vs Coalesce: Performance Differences for Data Redistribution](/apache/spark/repartition-in-spark-vs-coalesce-performance)

Understand repartition vs coalesce performance in Spark. Learn how repartition triggers a full shuffle for even distribution, while coalesce merges partitions to reduce overhead, impacting data balance.

- Tags: performance
- Published: 2026-02-16

### [What Is an RDD in Spark? Core Characteristics and Architecture Explained](/apache/spark/what-is-rdd-in-spark-question)

Discover RDD in Spark, the core abstraction for big data processing. Learn its immutable, partitioned nature, lazy evaluation, fault tolerance, and parallel operations.

- Tags: deep-dive
- Published: 2026-02-16

### [Databricks vs Spark: Performance and Cost Optimization Guide for Developers](/apache/spark/databricks-vs-spark-optimize-performance-cost)

Compare Databricks vs Spark for developers optimizing performance and cost. Learn how Databricks offers managed autoscaling and optimized runtime while Spark requires manual tuning.

- Tags: comparison
- Published: 2026-02-16

### [PySpark vs Spark: Key Performance Differences in Large-Scale Data Processing](/apache/spark/pyspark-vs-spark-performance-differences)

Explore PySpark vs Spark performance differences for large-scale data processing. Understand serialization overhead and Catalyst engine impact in Spark SQL.

- Tags: performance
- Published: 2026-02-16

### [How to Use Regex in Spark SQL Queries: RLIKE and REGEXP Operators Explained](/apache/spark/how-to-use-sql-regex-query)

Master Spark SQL regex with RLIKE and REGEXP operators. Learn to efficiently filter data using Java regex patterns in your SQL queries.

- Tags: how-to-guide
- Published: 2026-02-11