PySpark — Tune JDBC for Parallel effect
Now, if you are following me — Last post shared some performance optimization techniques for reading SQL Data Sources through JDBC. We looked upon Predicate Pushdown, Pushdown Query options.
Today let us tune the JDBC connection further to squeeze out each ounce of performance benefit for SQL data source reads.
In case you missed my last post on JDBC Predicate Pushdown, Checkout — https://urlit.me/blog/pyspark-jdbc-predicate-pushdown/
As usual we will do this with an example.
Lets create our SparkSession, with the required library files to read from a SQLite data source using JDBC.
# Create Spark Sessionfrom pyspark.sql import SparkSessionspark = SparkSession \
.builder \
.appName("Tuning JDBC") \
.config('spark.jars.packages', 'org.xerial:sqlite-jdbc:3.39.3.0') \
.master("local[*]") \
.getOrCreate()spark
Python decorator to measure the performance. We will use “noop” format for performance benchmarking.
# Lets create a simple Python decorator - {get_time} to…