Skip to content

Instantly share code, notes, and snippets.

@Zifah
Created February 9, 2019 21:26
Show Gist options
  • Save Zifah/ba0c3771069a11ba53969b000b038b82 to your computer and use it in GitHub Desktop.
Save Zifah/ba0c3771069a11ba53969b000b038b82 to your computer and use it in GitHub Desktop.
Write data directly to an Azure blob storage container from an Azure Databricks notebook
# Configure blob storage account access key globally
spark.conf.set(
"fs.azure.account.key.%s.blob.core.windows.net" % storage_name,
sas_key)
output_container_path = "wasbs://%s@%s.blob.core.windows.net" % (output_container_name, storage_name)
output_blob_folder = "%s/wrangled_data_folder" % output_container_path
# write the dataframe as a single file to blob storage
(dataframe
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.format("com.databricks.spark.csv")
.save(output_blob_folder))
# Get the name of the wrangled-data CSV file that was just saved to Azure blob storage (it starts with 'part-')
files = dbutils.fs.ls(output_blob_folder)
output_file = [x for x in files if x.name.startswith("part-")]
# Move the wrangled-data CSV file from a sub-folder (wrangled_data_folder) to the root of the blob container
# While simultaneously changing the file name
dbutils.fs.mv(output_file[0].path, "%s/predict-transform-output.csv" % output_container_path)
@Sarvesh-CSE
Copy link

This code is creating so many temporary files. For instance _committed....., _started........... and success. How I can avoid this.

@Zifah
Copy link
Author

Zifah commented Feb 21, 2020

I believe that was the same case when I ran the script for myself. One thing I would suggest is to write an additional script to delete the temporary files in the Azure blob once the data frame has been written to Azure successfully. I want to believe that there is a simple way that you can achieve that using some other dbutils method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment