PySpark - Create Data Frame from List or RDD on the fly

PySpark enables certain popular methods to create data frames on the fly from rdd, iterables such as Python List, RDD etc.

Method 1 — SparkSession range() method

# Create an Dataframe from range of values
df_range_1 = spark.range(5)
df_range_1.show(5, truncate = False)
Create an Dataframe from range of values
Create an Data Frame from range of values
# You can optionally specify start, end and steps as well
df_range_2 = spark.range(start = 1, end = 10, step = 2)
df_range_2.show(10, False)
Optionally specify start, end and steps as well

Method 2 — Spark createDataFrame() method

# Create Python Native List of Data
_data = [
["1", "Ram"],
["2", "Shyam"],
["3", "Asraf"],
["4", None]
]
# Create the list of column names
_cols = ["id", "name"]
# Create Data Frame using the createDataFrame method
df_users = spark.createDataFrame(data = _data, schema=_cols)
df_users.printSchema()
# Check Data Frame
df_users.show(truncate=False)
CreateDataFrame Method

Method 3 — Spark toDF() method

# From the same data list we create new RDD
_data_rdd = spark.sparkContext.parallelize(_data)
_data_rdd.collect()
# To check number of partitions of the data
_data_rdd.getNumPartitions()
Create RDD using parallelize method
# Create Data Frame from the rdd
df_users_new = _data_rdd.toDF(_cols)
df_users_new.show()
Use toDF method to create Data Frame from RDD

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store