Created
October 30, 2015 15:42
-
-
Save daxadax/14f576edd67954eee0ee to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
disclaimer: the URL used is just an example. I've picked the first from /r/opendirectories that worked good enough | |
Not sure if this post is ok, but I figured I'd share some tips for mass-downloading things from directories. | |
I've used this for dirs with lots and lots of say, mp3 files. It's easily customized, and better yet, fast! | |
If the server in question uses apache (which thankfully most do), retrieving a textual representation of the URLs that is stored in a local file is usually the first step one should do. That way the index doesn't have to be re-downloaded all the time, thus hugely improving speed. | |
I found lynx does the best job at this, because lynx comes with a feature that transforms the HTML tree into a textual representation: | |
lynx -dump http://ls.df.vc/pictures/ > listing.txt | |
results in this (for example): | |
Index of /pictures/ | |
Name Last Modified Size Type | |
[1]Parent Directory/ - Directory | |
[2]avatar/ 2009-Oct-10 21:35:43 - Directory | |
[3]duckie/ 2009-Jul-18 03:02:57 - Directory | |
[ ... many more lines skipped ...] | |
lighttpd/1.4.35 | |
References | |
1. http://ls.df.vc/ | |
2. http://ls.df.vc/pictures/avatar/ | |
3. http://ls.df.vc/pictures/duckie/ | |
4. http://ls.df.vc/pictures/dump/ | |
5. http://ls.df.vc/pictures/gif/ | |
6. http://ls.df.vc/pictures/hayka/ | |
7. http://ls.df.vc/pictures/mock_the_war/ | |
8. http://ls.df.vc/pictures/record_store_gats/ | |
9. http://ls.df.vc/pictures/404_gf_not_found.jpg | |
10. http://ls.df.vc/pictures/8bit_wedding.jpg | |
11. http://ls.df.vc/pictures/9_deadly_words_user_by_a_women.jpg | |
[... snibbedy snib ...] | |
The stuff we want begins after 'References'. We don't have to parse this manually, thanks to regular expressions. A simple script that would grep the URLs and then feed the whole mess to wget, while also checking that we're not pointlessly downloading files we already retrieved: | |
url="http://ls.df.vc/" | |
# we only want files ending in '.jpg' or '.png' for now | |
grep 'http://.*\.(jpg|png)' listing.txt -oh | while read url; do | |
filename="$(basename "$url" | urldecode)" | |
if [[ ! -f "$filename" ]]; then | |
wget -c "$url" | |
fi | |
done | |
The urldecode script is nothing to write home about, although it's been working flawlessly so far (just put in your ~/bin dir): | |
#!/usr/bin/env perl | |
use URI::Encode; | |
my $uri = URI::Encode->new({ encode_reserved => 0 }); | |
while (<>) | |
{ | |
print $uri->decode($_) | |
} | |
However... | |
While using bash works fine, it's really slow. Especially for large directories, a good percentage of waiting is literally just bash doing its thing. This is mostly because bash has to fork/exec all the time, and fork is a rather heavy function call. | |
We can massively improve speed by using a different (better, maybe) language. I'm using Ruby, although you can do the same in Python, Perl, Lisp, even PHP if you feel that way. I don't judge. | |
So here's the Ruby script that does the same thing: | |
url = "http://ls.df.vc/" | |
File.open("listing.txt", "r") do |fh| | |
listing = fh.read | |
listing.scan(/\d\d?\d?\d?\d?\d?. (http:\/\/.*.(jpg|png))/).each do |m| | |
furl = m.shift | |
base = File.basename(furl) | |
filename = URI.unescape(base) | |
if not File.file?(filename) then | |
system("wget", "-c", furl) | |
end | |
end | |
end | |
By the way, you don't need to run Linux, Unix, et cetera for these scripts to work: For the bash variant, MinGW will work just fine. For the ruby script I strongly suggest installing cygwin -- it's easy, doesn't invade your %PATH%, and you get a proper POSIX environment on windows. What more could you possibly ask for? | |
Also, I hope this post isn't messing with the subreddit rules. Just trying to help. If you have any questions, just go ahead! I'll answer them (if I can). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment