PySpark — The Cluster Configuration
Spark Applications hugely relies on the cluster configuration used for execution. It is important to know the cluster sizing beforehand to increase the efficiency providing enough resource for processing.
Spark achieves its power through the distributed parallel processing capability. For better results we always have to tune and estimate the correct configuration from all possibilities.
Spark determines the degree of parallelism = number of executors X number of cores per executor
Lets consider the following example: We have a cluster of 10 nodes, 26 cores/node and 256GB memory/node.
Fat Executors: In case we assign all cores to create a single executor per node i.e. 1 executor/node with 26 cores/node. Downside — It will create a lot of Garbage Collection (GC) issues leading to slow performance.
Tiny/Slim Executors: In case we assign 1 core/executor and create 26 executor/node from the above configuration. Downside — There will be too much of data movement between executors keeping the application busy throughout leading to performance degradation.
So, what can be a possible better configuration? Its seems from several tests that 4 to 5 cores/executors provide the optimum performance and avoids issue of Fat or Tiny Executor.