Open in app

Sign in

Write

Sign in

Subham Khandelwal
Subham Khandelwal

601 Followers

Home

About

Pinned

EaseWithApacheSpark — PySpark Series

Series follows learning from Apache Spark (PySpark) with quick tips and workaround for daily problems in hand. If you are new to Spark, PySpark or want to learn more — I teach Big Data, Spark, Data Engineering & Data Warehousing on my YouTube Channel — Ease With Data Click on…

Spark

2 min read

LearnBigData101 — Spark Series
LearnBigData101 — Spark Series
Spark

2 min read


Nov 19

PySpark — DAG & Explain Plans

Understand How Spark divides Jobs into Stages and Tasks? — Knowing what happens when we send a job or application in Spark for execution is really important. We usually ignore it because we don’t realize how crucial it is. But if we want to get why it matters, DAGs and Explain plans can help us figure it out. Today we…

Pyspark

4 min read

PySpark — DAG & Explain Plans
PySpark — DAG & Explain Plans
Pyspark

4 min read


Published in

Dev Genius

·Sep 30

PySpark — Unit Test Cases using PyTest

Understand how to write unit test cases for PySpark using PyTest module. — We all write code to develop data applications. To make sure the code is working as expected at ground zero level, we need to write Unit Test Cases. This article will help you start with Unit Test Cases in PySpark. …

Pyspark

5 min read

PySpark — Unit Test Cases using PyTest
PySpark — Unit Test Cases using PyTest
Pyspark

5 min read


Apr 16

PySpark — Optimize Parquet Files

Understand How Parquet files can be compacted efficiently utilizing RLE (Run Length Encoding) — Parquet files are one of the most popular choice for data storage in Data & Analytics world for various reasons. Few of them lists as: Write once read Many paradigm Columnar storage Preserve Schema Optimization with Encoding etc. Today we will understand how efficiently we can utilize the default encodings…

Pyspark

5 min read

PySpark — Optimize Parquet Files
PySpark — Optimize Parquet Files
Pyspark

5 min read


Published in

Dev Genius

·Mar 21

PySpark — Estimate Partition Count for File Read

Understand how Spark estimates the number of Partitions required to read a file — Spark reads file in partitions and each partition is processed to reach the desired result. But, more often we are confused over how the number of partitions is estimated and divided ? Today we will deep dive inside the logic that Spark uses for this calculation. But before we can…

Pyspark

5 min read

PySpark — Estimate Partition Count for File Read
PySpark — Estimate Partition Count for File Read
Pyspark

5 min read


Mar 18

PySpark — Optimize Huge File Read

We all have been in scenario, where we have to deal with huge file sizes with limited compute or resources. More often we need to find ways to optimize such file read/processing to make our data pipelines efficient. Today we are going to discuss about one such configuration of Spark…

Pyspark

5 min read

PySpark — Optimize Huge File Read
PySpark — Optimize Huge File Read
Pyspark

5 min read


Published in

Dev Genius

·Mar 13

PySpark — The Effects of Multiline

Optimizations are all around us, few of them are hidden in small parameter changes and few in the way we deal with data. Today, we will understand The effects of Multiline in files with Apache Spark. To explore the effect, we will read same Orders JSON file with approximate 10,50,000…

Pyspark

3 min read

PySpark — The Effects of Multiline
PySpark — The Effects of Multiline
Pyspark

3 min read


Published in

Dev Genius

·Mar 9

PySpark — Worst use of Window Functions

It is very important to understand the core of data distribution for Apache Spark. Very often we don’t realize the importance and make mistakes without even knowing. One of such mistake is to use row_number() for generating Surrogate Keys for Dimensions. Surrogate Keys are basically synthetic/artificial keys that are generated…

Spark

4 min read

PySpark — Worst use of Window Functions
PySpark — Worst use of Window Functions
Spark

4 min read


Mar 3

PySpark —Data Frame Joins on Multiple conditions

We often run into situations where we have to join two Spark Data Frames on multiple conditions and those conditions can be complex and may change as per requirement. We will work on a simple hack that will make our join conditions way much more effective and simpler to use. …

Pyspark

4 min read

PySpark —Data Frame Joins on Multiple conditions
PySpark —Data Frame Joins on Multiple conditions
Pyspark

4 min read


Feb 25

Data Lakehouse with PySpark — Batch Loading Strategy

In this article we are going to design the Batch Loading Strategy to load data in our Data Lakehouse for the series Data Lakehouse with PySpark. This is a well-known and re-usable strategy which can be implemented as a part any Data Warehouse or Data Lakehouse project. If you are…

Spark

3 min read

Data Lakehouse with PySpark — Batch Loading Strategy
Data Lakehouse with PySpark — Batch Loading Strategy
Spark

3 min read

Subham Khandelwal

Subham Khandelwal

601 Followers

Senior Data Engineer | YouTube: https://www.youtube.com/@easewithdata | TopMate : https://topmate.io/subham_khandelwal

Help

Status

About

Careers

Blog

Privacy

Terms

Text to speech

Teams