PySpark — The Factor of Cores

Subham Khandelwal
4 min readOct 28, 2022

Default Parallelism is a very standout factor in Spark executions. Basically its the number of tasks Spark can raise in parallel and it very well depends on the number of cores we have for execution.

Representation Image

For a well optimized data load, it is very important to tune the degree of parallelism and the factor of core comes into play. Default Parallelism in Spark is defined as the total number of cores available for execution.

Let run through a quick example to understand the same. We define the our environment with two cores for simple understanding and visualization of problem.

# Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Factor of cores") \
.master("local[2]") \
.getOrCreate()
spark
SparkSession

Available parallelism now

# Determine the degree of parallelism
spark.sparkContext.defaultParallelism

Disable all AQE features for baseline

# Disable all AQE optimization for benchmarking tests
spark.conf.set("spark.sql.adaptive.enabled", False)…

--

--