PySpark — Tune JDBC for Parallel effect

Now, if you are following me — Last post shared some performance optimization techniques for reading SQL Data Sources through JDBC. We looked upon Predicate Pushdown, Pushdown Query options.

Today let us tune the JDBC connection further to squeeze out each ounce of performance benefit for SQL data source reads.

JDBC Connection

In case you missed my last post on JDBC Predicate Pushdown, Checkout — https://urlit.me/blog/pyspark-jdbc-predicate-pushdown/

As usual we will do this with an example.

Lets create our SparkSession, with the required library files to read from a SQLite data source using JDBC.

SparkSession creation

Python decorator to measure the performance. We will use “noop” format for performance benchmarking.

Define the SQLite JDBC parameters for SQL read

Now lets read the data from database without any tuning. We are not focussing on Predicate Pushdown in this example.

Keep a note on the timings.

Full read without and JDBC tuning

As you can see, the full read was made with single connection to the DB. There were no parallel reads made.

Lets optimize this. Identify a partition key with upperBound and lowerBound values to create parallel partitions. In this case, lets assume trx_id.

lower and upper bound for partition key

Determine the parallel processing capability

Number of cores

So, lets tune our JDBC connection with above parameters

Performance with tuned JDBC

Its took almost 1/3rd time, leading to 3X times faster reads and our data is also partitioned now. This difference will change with the amount of data and variation of partitionColumn.

It is always advised to use non-skewed partitionColumn for further improvements and data distribution. Make sure to increase the numPartitions which is supported with the JDBC.

There are many more parameters available, checkout — https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Checkout iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/16_Tuning_JDBC.ipynb

Checkout my personal blog — https://urlit.me/blog/

Checkout PySpark Medium Series — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

Wish to Buy me a Coffee: Buy Subham a Coffee

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store