Member-only story

PySpark — The Factor of Cores

4 min readOct 28, 2022

Default Parallelism is a very standout factor in Spark executions. Basically its the number of tasks Spark can raise in parallel and it very well depends on the number of cores we have for execution.

For a well optimized data load, it is very important to tune the degree of parallelism and the factor of core comes into play. Default Parallelism in Spark is defined as the total number of cores available for execution.

Let run through a quick example to understand the same. We define the our environment with two cores for simple understanding and visualization of problem.

# Create Spark Session
from pyspark.sql import SparkSessionspark = SparkSession \
    .builder \
    .appName("Factor of cores") \
    .master("local[2]") \
    .getOrCreate()spark

Available parallelism now

# Determine the degree of parallelism
spark.sparkContext.defaultParallelism

Disable all AQE features for baseline

# Disable all AQE optimization for benchmarking tests
spark.conf.set("spark.sql.adaptive.enabled", False)…

PySpark — The Factor of Cores

Written by Subham Khandelwal

Responses (2)