PySpark — Tune JDBC for Parallel effect

Subham Khandelwal
3 min readOct 18, 2022

Now, if you are following me — Last post shared some performance optimization techniques for reading SQL Data Sources through JDBC. We looked upon Predicate Pushdown, Pushdown Query options.

Today let us tune the JDBC connection further to squeeze out each ounce of performance benefit for SQL data source reads.

JDBC Connection

In case you missed my last post on JDBC Predicate Pushdown, Checkout — https://urlit.me/blog/pyspark-jdbc-predicate-pushdown/

As usual we will do this with an example.

Lets create our SparkSession, with the required library files to read from a SQLite data source using JDBC.

# Create Spark Sessionfrom pyspark.sql import SparkSessionspark = SparkSession \
.builder \
.appName("Tuning JDBC") \
.config('spark.jars.packages', 'org.xerial:sqlite-jdbc:3.39.3.0') \
.master("local[*]") \
.getOrCreate()
spark
SparkSession creation

Python decorator to measure the performance. We will use “noop” format for performance benchmarking.

# Lets create a simple Python decorator - {get_time} to

--

--

Responses (1)