Skip to content

Instantly share code, notes, and snippets.

@jordangarrison
Last active March 14, 2025 16:59
Show Gist options
  • Save jordangarrison/3334afa699b2c0fa2f8e34d4296652a9 to your computer and use it in GitHub Desktop.
Save jordangarrison/3334afa699b2c0fa2f8e34d4296652a9 to your computer and use it in GitHub Desktop.
Download files in a spreadsheet to a specific directory on my machine

Key improvements and explanations:

  • Error Handling: Crucially includes robust error handling:
    • FileNotFoundError for the Excel file.
    • Checks for the existence of the required columns ('URL' and 'filename').
    • requests.exceptions.RequestException to catch network errors (connection errors, timeouts, etc.) during the download.
    • response.raise_for_status() to check for HTTP errors (4xx or 5xx status codes) and raise an exception if one occurs. This is very important for handling failed downloads.
    • General Exception catch-all for unexpected errors during file processing.
  • Command-Line Arguments: Uses argparse to handle command-line arguments for the Excel file path and output directory. This makes the script much more flexible and reusable.
  • Output Directory Creation: os.makedirs(output_dir, exist_ok=True) creates the output directory if it doesn't exist. exist_ok=True prevents an error if the directory already exists.
  • Chunked Download: Downloads files in chunks (response.iter_content(chunk_size=8192)) to handle large files efficiently and avoid loading the entire file into memory at once.
  • Clearer Output: Prints informative messages about downloaded files and any errors that occur.
  • Handles Missing Values: Checks for pd.isna(url) or pd.isna(filename) to skip rows with missing URLs or filenames, preventing crashes.
  • Filepath Construction: Uses os.path.join to construct filepaths correctly, ensuring cross-platform compatibility.
  • Comments and Readability: Well-commented and formatted for better understanding.
  • if __name__ == "__main__": block: Ensures that the download_files_from_excel function is only called when the script is run directly (not when it's imported as a module).
  • Excel Row Indexing: The error message for skipping rows now correctly indicates the Excel row number (which is 1-indexed).

How to use:

  1. Install Libraries:

    pip install pandas requests
  2. Save the script: Save the code as a Python file (e.g., download_script.py).

  3. Prepare your Excel file: Create an Excel file with columns named "URL" and "filename". The "URL" column should contain the URLs of the files you want to download, and the "filename" column should contain the desired filenames for the downloaded files.

  4. Run the script:

    python download_script.py your_excel_file.xlsx output_directory

    Replace your_excel_file.xlsx with the actual path to your Excel file and output_directory with the desired output directory.

Example Excel file (data.xlsx):

URL filename
https://www.easygifanimator.net/images/samples/video-to-gif-sample.gif sample.gif
https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf dummy.pdf
https://www.sample-videos.com/img/Sample-jpg-image-50kb.jpg image.jpg
missing_file.txt
https://www.example.com/doesnotexist nonexistent.txt

This improved version addresses potential errors, handles large files efficiently, and provides a more user-friendly experience. It's also more robust and reliable for real-world use.

import argparse
import pandas as pd
import requests
import os
from urllib.parse import urlparse
def download_files_from_excel(excel_file, output_dir):
"""
Reads an Excel file, downloads files from URLs in the 'URL' column,
and saves them with filenames from the 'filename' column into a specified directory.
Args:
excel_file (str): Path to the Excel file.
output_dir (str): Path to the output directory.
"""
try:
df = pd.read_excel(excel_file)
except FileNotFoundError:
print(f"Error: Excel file not found at {excel_file}")
return
except Exception as e:
print(f"Error reading Excel file: {e}")
return
if 'URL' not in df.columns or 'filename' not in df.columns:
print("Error: Excel file must contain 'URL' and 'filename' columns.")
return
# Create the output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
for index, row in df.iterrows():
url = row['URL']
filename = row['filename']
if pd.isna(url) or pd.isna(filename):
print(f"Skipping row {index + 2} due to missing URL or filename.") # Excel rows are 1-indexed
continue
filepath = os.path.join(output_dir, filename)
try:
response = requests.get(url, stream=True)
response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
with open(filepath, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print(f"Downloaded: {filename} from {url} to {filepath}")
except requests.exceptions.RequestException as e:
print(f"Error downloading {filename} from {url}: {e}")
except Exception as e:
print(f"Error processing {filename}: {e}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Download files from URLs listed in an Excel file.")
parser.add_argument("excel_file", help="Path to the Excel file.")
parser.add_argument("output_dir", help="Path to the output directory.")
args = parser.parse_args()
download_files_from_excel(args.excel_file, args.output_dir)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment