PySpark — Delta Lake Integration using Manifest
Delta Lake enables to read data from Other sources such as Presto, AWS Athena with the help of Symlink manifest file.
To integrate other sources we have to generate a manifest file from the delta table. This manifest file lists the data files of the current version after all operations on the table.
Consider the following delta lake table.
Now, if we create an External Athena table directly on the location, we will have issues as all versions of data files still exists in the same location.
So, to fix that we will generate a manifest and point the Athena table to read data files with help of manifest.
# Generate the symlink manifest for the delta table
dt.generate("symlink_format_manifest")
This command creates a new folder within table folder location with manifest file as shown below:
And the contents of the manifest, contains the name of the latest data files