Last active
March 1, 2018 08:16
-
-
Save BartlomiejSkwira/75f2f1629ac10acca6d3c70324853e00 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Wget is a command-line utility that can retrieve all kinds of files over the HTTP and FTP protocols. Since websites are served through HTTP and most web media files are accessible through HTTP or FTP, this makes Wget an excellent tool for ripping websites. | |
While Wget is typically used to download single files, it can be used to recursively download all pages and files that are found through an initial page: | |
wget -r -p //www.makeuseof.com | |
However, some sites may detect and prevent what you’re trying to do because ripping a website can cost them a lot of bandwidth. To get around this, you can disguise yourself as a web browser with a user agent string: | |
wget -r -p -U Mozilla //www.makeuseof.com | |
If you want to be polite, you should also limit your download speed (so you don’t hog the web server’s bandwidth) and pause between each download (so you don’t overwhelm the web server with too many requests): | |
wget -r -p -U Mozilla --wait=10 --limit-rate=35K //www.makeuseof.com | |
Wget comes bundled with most Unix-based systems. On Mac, you can install Wget using a single Homebrew command: brew install wget (how to set up Homebrew on Mac). On Windows, you’ll need to use this ported version instead. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment