PySpark - Create Data Frame from List or RDD on the fly
PySpark enables certain popular methods to create data frames on the fly from rdd, iterables such as Python List, RDD etc.
Method 1 — SparkSession range() method
# Create an Dataframe from range of values
df_range_1 = spark.range(5)
df_range_1.show(5, truncate = False)
# You can optionally specify start, end and steps as well
df_range_2 = spark.range(start = 1, end = 10, step = 2)
df_range_2.show(10, False)
Method 2 — Spark createDataFrame() method
# Create Python Native List of Data
_data = [
["1", "Ram"],
["2", "Shyam"],
["3", "Asraf"],
["4", None]
]# Create the list of column names
_cols = ["id", "name"]# Create Data Frame using the createDataFrame method
df_users = spark.createDataFrame(data = _data, schema=_cols)
df_users.printSchema()# Check Data Frame
df_users.show(truncate=False)
Method 3 — Spark toDF() method
# From the same data list we create new RDD
_data_rdd = spark.sparkContext.parallelize(_data)
_data_rdd.collect()# To check number of partitions of the data
_data_rdd.getNumPartitions()
# Create Data Frame from the rdd
df_users_new = _data_rdd.toDF(_cols)
df_users_new.show()
Link to iPython notebook on GitHub — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/1_create_df_on_fly.ipynb