EaseWithApacheSpark — PySpark Series

Subham Khandelwal
2 min readOct 1, 2022

--

Series follows learning from Apache Spark (PySpark) with quick tips and workaround for daily problems in hand.

Representatiton Image

Click on the links below to follow -

  1. PySpark — Create Data Frame from Python List or Iterable
  2. PySpark — Parse Spark Schema from datatype string
  3. PySpark — Create Data Frame from API
  4. PySpark — Read/Parse JSON column from another Data Frame
  5. PySpark — Flatten JSON/Struct Data Frame dynamically
  6. PySpark — Merge Data Frames with different Schema
  7. PySpark — Optimize Pivot Data Frames like a PRO
  8. PySpark — User Defined Functions vs Higher Order Functions
  9. PySpark — The Famous Salting Technique
  10. PySpark — Columnar Read Optimization
  11. PySpark — The Magic of AQE Coalesce
  12. PySpark — The Tiny File Problem
  13. PySpark — Read Binary Files like PNG or PDF
  14. PySpark — Read Compressed gzip files
  15. PySpark — JDBC Predicate Pushdown
  16. PySpark — Tune JDBC for Parallel effect
  17. PySpark — The Basics of Structured Streaming
  18. PySpark — Count(1) vs Count(*) vs Count(col_name)
  19. PySpark — Distributed Broadcast Variable
  20. PySpark — The Cluster Configuration
  21. PySpark — Optimize Data Scanning exponentially
  22. PySpark — The Factor of Cores
  23. PySpark — Fix Column Header with Spaces
  24. PySpark — Dynamic Partition Overwrite
  25. PySpark — Upsert or SCD1 with Dynamic Overwrite
  26. PySpark — Implementing Persisting Metastore
  27. PySpark — Setup Delta Lake
  28. PySpark — Delta Lake Column Mapping
  29. PySpark — Delta Lake Integration using Manifest
  30. PySpark — Connect Azure ADLS Gen 2
  31. PySpark — Structured Streaming Read from Sockets
  32. PySpark — Structured Streaming Read from Files
  33. PySpark — Structured Streaming Read from Kafka
  34. PySpark — Connect AWS S3
  35. PySpark — Data Frame Joins on Multiple conditions
  36. PySpark — Worst use of Window Functions
  37. PySpark — The Effects of Multiline
  38. PySpark — Optimize Huge File Read
  39. PySpark — Estimate Partition Count for File Read
  40. PySpark — Optimize Parquet Files

If you love this series and Wish to Buy me a Coffee: Buy Subham a Coffee

Checkout Ease With Data YouTube Channel: https://www.youtube.com/@easewithdata

Wish to connect with me: https://topmate.io/subham_khandelwal

Checkout my Personal Blog — https://urlit.me/blog

GitHub URL for iPython Notebooks — https://github.com/subhamkharwal/ease-with-apache-spark

Please like and follow for more posts.

--

--

Subham Khandelwal

⚒️ Senior Data Engineer with 10+ YOE | 📽️ YouTube channel: https://www.youtube.com/@easewithdata | 📞 TopMate : https://topmate.io/subham_khandelwal