PySpark Series — Basics to Advanced

Series of Medium articles detailing some known and unknown concepts of Apache Spark

2 min readOct 1, 2022

Series follows learning from Apache Spark (PySpark) with quick tips and workaround for daily problems in hand.

If you are new to Spark, PySpark or want to learn more — I teach Big Data, Spark, Data Engineering & Data Warehousing on my YouTube Channel — Ease With Data. Link to PySpark Series.

Click on the links below to follow -

PySpark — Create Data Frame from Python List or Iterable
PySpark — Parse Spark Schema from datatype string
PySpark — Create Data Frame from API
PySpark — Read/Parse JSON column from another Data Frame
PySpark — Flatten JSON/Struct Data Frame dynamically
PySpark — Merge Data Frames with different Schema
PySpark — Optimize Pivot Data Frames like a PRO
PySpark — User Defined Functions vs Higher Order Functions
PySpark — The Famous Salting Technique
PySpark — Columnar Read Optimization
PySpark — The Magic of AQE Coalesce
PySpark — The Tiny File Problem
PySpark — Read Binary Files like PNG or PDF
PySpark — Read Compressed gzip files
PySpark — JDBC Predicate Pushdown
PySpark — Tune JDBC for Parallel effect
PySpark — The Basics of Structured Streaming
PySpark — Count(1) vs Count(*) vs Count(col_name)
PySpark — Distributed Broadcast Variable
PySpark — The Cluster Configuration
PySpark — Optimize Data Scanning exponentially
PySpark — The Factor of Cores
PySpark — Fix Column Header with Spaces
PySpark — Dynamic Partition Overwrite
PySpark — Upsert or SCD1 with Dynamic Overwrite
PySpark — Implementing Persisting Metastore
PySpark — Setup Delta Lake
PySpark — Delta Lake Column Mapping
PySpark — Delta Lake Integration using Manifest
PySpark — Connect Azure ADLS Gen 2
PySpark — Structured Streaming Read from Sockets
PySpark — Structured Streaming Read from Files
PySpark — Structured Streaming Read from Kafka
PySpark — Connect AWS S3
PySpark — Data Frame Joins on Multiple conditions
PySpark — Worst use of Window Functions
PySpark — The Effects of Multiline
PySpark — Optimize Huge File Read
PySpark — Estimate Partition Count for File Read
PySpark — Optimize Parquet Files
PySpark — Unit Test Cases using PyTest
PySpark — DAG & Explain Plans
PySpark — Optimize Joins in Spark
PySpark — Dynamic Resource Allocation in Spark
PySpark — Spark Streaming Checkpoint Directory
PySpark — Spark Streaming Error and Exception Handling
PySpark — Data Scanning and Partitioning
PySpark — Run Multiple Jobs in Parallel
PySpark — What is Spark Connect?

If you love this series and Wish to Buy me a Coffee: Buy Subham a Coffee

Checkout Ease With Data YouTube Channel: https://www.youtube.com/@easewithdata

Wish to connect with me: https://topmate.io/subham_khandelwal

Checkout my Personal Blog — https://urlit.me/blog

GitHub URL for iPython Notebooks — https://github.com/subhamkharwal/ease-with-apache-spark

Please like and follow for more posts.

PySpark Series — Basics to Advanced

Series of Medium articles detailing some known and unknown concepts of Apache Spark

Written by Subham Khandelwal

Responses (1)