Open in app

Sign In

Write

Sign In

Subham Khandelwal ✅
Subham Khandelwal ✅

332 Followers

Home

About

Pinned

EaseWithApacheSpark — PySpark Series

Series follows learning from Apache Spark (PySpark) with quick tips and workaround for daily problems in hand. Click on the links below to follow - PySpark — Create Data Frame from Python List or Iterable PySpark — Parse Spark Schema from datatype string PySpark — Create Data Frame from API …

Spark

2 min read

LearnBigData101 — Spark Series
LearnBigData101 — Spark Series
Spark

2 min read


Pinned

EaseWithData — Data Warehouse Series

If you are new to data warehousing or preparing for an Interview, then this Series is for you. This Series follow YouTube videos explaining the Basics of Data Warehouse in simple terms. Make sure to Like and Subscribe to our channel. Click on the links below to follow: Introduction to…

Data Warehouse

1 min read

EaseWithData — Data Warehouse Series
EaseWithData — Data Warehouse Series
Data Warehouse

1 min read


Published in Dev Genius

·4 days ago

PySpark — Estimate Partition Count for File Read

Understand how Spark estimates the number of Partitions required to read a file — Spark reads file in partitions and each partition is processed to reach the desired result. But, more often we are confused over how the number of partitions is estimated and divided ? Today we will deep dive inside the logic that Spark uses for this calculation. But before we can…

Pyspark

5 min read

PySpark — Estimate Partition Count for File Read
PySpark — Estimate Partition Count for File Read
Pyspark

5 min read


Mar 18

PySpark — Optimize Huge File Read

We all have been in scenario, where we have to deal with huge file sizes with limited compute or resources. More often we need to find ways to optimize such file read/processing to make our data pipelines efficient. Today we are going to discuss about one such configuration of Spark…

Pyspark

5 min read

PySpark — Optimize Huge File Read
PySpark — Optimize Huge File Read
Pyspark

5 min read


Published in Dev Genius

·Mar 13

PySpark — The Effects of Multiline

Optimizations are all around us, few of them are hidden in small parameter changes and few in the way we deal with data. Today, we will understand The effects of Multiline in files with Apache Spark. To explore the effect, we will read same Orders JSON file with approximate 10,50,000…

Pyspark

3 min read

PySpark — The Effects of Multiline
PySpark — The Effects of Multiline
Pyspark

3 min read


Published in Dev Genius

·Mar 9

PySpark — Worst use of Window Functions

It is very important to understand the core of data distribution for Apache Spark. Very often we don’t realize the importance and make mistakes without even knowing. One of such mistake is to use row_number() for generating Surrogate Keys for Dimensions. Surrogate Keys are basically synthetic/artificial keys that are generated…

Spark

4 min read

PySpark — Worst use of Window Functions
PySpark — Worst use of Window Functions
Spark

4 min read


Mar 3

PySpark —Data Frame Joins on Multiple conditions

We often run into situations where we have to join two Spark Data Frames on multiple conditions and those conditions can be complex and may change as per requirement. We will work on a simple hack that will make our join conditions way much more effective and simpler to use. …

Pyspark

4 min read

PySpark —Data Frame Joins on Multiple conditions
PySpark —Data Frame Joins on Multiple conditions
Pyspark

4 min read


Feb 25

Data Lakehouse with PySpark — Batch Loading Strategy

In this article we are going to design the Batch Loading Strategy to load data in our Data Lakehouse for the series Data Lakehouse with PySpark. This is a well-known and re-usable strategy which can be implemented as a part any Data Warehouse or Data Lakehouse project. If you are…

Spark

3 min read

Data Lakehouse with PySpark — Batch Loading Strategy
Data Lakehouse with PySpark — Batch Loading Strategy
Spark

3 min read


Feb 21

Data Lakehouse with PySpark — Setup Delta Lake Warehouse on S3 and Boto3 with AWS

Next, as part of the series Data Lakehouse with PySpark, we need to setup boto3 and Delta Lake to communicate with AWS S3. This will help us to create our default warehouse location for Delta Lake on AWS S3. We will also setup the metastore location for Delta Lake. To…

Spar

3 min read

Data Lakehouse with PySpark — Setup Delta Lake Warehouse on S3 and Boto3 with AWS
Data Lakehouse with PySpark — Setup Delta Lake Warehouse on S3 and Boto3 with AWS
Spar

3 min read


Feb 10

Data Lakehouse with PySpark — Setup PySpark Docker Jupyter Lab env

As part of the series Data Lakehouse with PySpark, we need to setup the PySpark environment on Docker in Jupyter lab. Today we are going to set up the same in few simple steps. This environment can further be used for your personal use-cases and practise. If you are still…

Spark

3 min read

Data Lakehouse with PySpark — Setup PySpark Docker Jupyter Lab env
Data Lakehouse with PySpark — Setup PySpark Docker Jupyter Lab env
Spark

3 min read

Subham Khandelwal ✅

Subham Khandelwal ✅

332 Followers

⚒️ Senior Data Engineer with 10+ YOE | 📽️ YouTube channel: https://www.youtube.com/@easewithdata | 📞 TopMate : https://topmate.io/subham_khandelwal

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech