PySpark Series — Basics to Advanced

Series of Medium articles detailing some known and unknown concepts of Apache Spark

Subham Khandelwal
2 min readOct 1, 2022

Series follows learning from Apache Spark (PySpark) with quick tips and workaround for daily problems in hand.

Representation Image

If you are new to Spark, PySpark or want to learn more — I teach Big Data, Spark, Data Engineering & Data Warehousing on my YouTube Channel — Ease With Data. Link to PySpark Series.

Click on the links below to follow -

  1. PySpark — Create Data Frame from Python List or Iterable
  2. PySpark — Parse Spark Schema from datatype string
  3. PySpark — Create Data Frame from API
  4. PySpark — Read/Parse JSON column from another Data Frame
  5. PySpark — Flatten JSON/Struct Data Frame dynamically
  6. PySpark — Merge Data Frames with different Schema
  7. PySpark — Optimize Pivot Data Frames like a PRO
  8. PySpark — User Defined Functions vs Higher Order Functions
  9. PySpark — The Famous Salting Technique
  10. PySpark — Columnar Read Optimization
  11. PySpark — The Magic of AQE Coalesce
  12. PySpark — The Tiny File Problem
  13. PySpark — Read Binary Files like PNG or PDF
  14. PySpark — Read Compressed gzip files
  15. PySpark — JDBC Predicate Pushdown
  16. PySpark — Tune JDBC for Parallel effect
  17. PySpark — The Basics of Structured Streaming
  18. PySpark — Count(1) vs Count(*) vs Count(col_name)
  19. PySpark — Distributed Broadcast Variable
  20. PySpark — The Cluster Configuration
  21. PySpark — Optimize Data Scanning exponentially
  22. PySpark — The Factor of Cores
  23. PySpark — Fix Column Header with Spaces
  24. PySpark — Dynamic Partition Overwrite
  25. PySpark — Upsert or SCD1 with Dynamic Overwrite
  26. PySpark — Implementing Persisting Metastore
  27. PySpark — Setup Delta Lake
  28. PySpark — Delta Lake Column Mapping
  29. PySpark — Delta Lake Integration using Manifest
  30. PySpark — Connect Azure ADLS Gen 2
  31. PySpark — Structured Streaming Read from Sockets
  32. PySpark — Structured Streaming Read from Files
  33. PySpark — Structured Streaming Read from Kafka
  34. PySpark — Connect AWS S3
  35. PySpark — Data Frame Joins on Multiple conditions
  36. PySpark — Worst use of Window Functions
  37. PySpark — The Effects of Multiline
  38. PySpark — Optimize Huge File Read
  39. PySpark — Estimate Partition Count for File Read
  40. PySpark — Optimize Parquet Files
  41. PySpark — Unit Test Cases using PyTest
  42. PySpark — DAG & Explain Plans
  43. PySpark — Optimize Joins in Spark
  44. PySpark — Dynamic Resource Allocation in Spark
  45. PySpark — Spark Streaming Checkpoint Directory
  46. PySpark — Spark Streaming Error and Exception Handling
  47. PySpark — Data Scanning and Partitioning
  48. PySpark — Run Multiple Jobs in Parallel

If you love this series and Wish to Buy me a Coffee: Buy Subham a Coffee

Checkout Ease With Data YouTube Channel: https://www.youtube.com/@easewithdata

Wish to connect with me: https://topmate.io/subham_khandelwal

Checkout my Personal Blog — https://urlit.me/blog

GitHub URL for iPython Notebooks — https://github.com/subhamkharwal/ease-with-apache-spark

Please like and follow for more posts.

--

--