daxadax · October 30, 2015 15:42
diff --git a/gistfile1.txt b/gistfile1.txt
 disclaimer: the URL used is just an example. I've picked the first from /r/opendirectories that worked good enough
 Not sure if this post is ok, but I figured I'd share some tips for mass-downloading things from directories.
 I've used this for dirs with lots and lots of say, mp3 files. It's easily customized, and better yet, fast!
 If the server in question uses apache (which thankfully most do), retrieving a textual representation of the URLs that is stored in a local file is usually the first step one should do. That way the index doesn't have to be re-downloaded all the time, thus hugely improving speed.
 I found lynx does the best job at this, because lynx comes with a feature that transforms the HTML tree into a textual representation:
 lynx -dump http://ls.df.vc/pictures/ > listing.txt
 results in this (for example):
 Index of /pictures/

   Name Last Modified Size Type
   [1]Parent Directory/   -   Directory
   [2]avatar/ 2009-Oct-10 21:35:43 -   Directory
   [3]duckie/ 2009-Jul-18 03:02:57 -   Directory
   [ ... many more lines skipped ...]
   lighttpd/1.4.35

 References

   1. http://ls.df.vc/
   2. http://ls.df.vc/pictures/avatar/
   3. http://ls.df.vc/pictures/duckie/
   4. http://ls.df.vc/pictures/dump/
   5. http://ls.df.vc/pictures/gif/
   6. http://ls.df.vc/pictures/hayka/
   7. http://ls.df.vc/pictures/mock_the_war/
   8. http://ls.df.vc/pictures/record_store_gats/
   9. http://ls.df.vc/pictures/404_gf_not_found.jpg
  10. http://ls.df.vc/pictures/8bit_wedding.jpg
  11. http://ls.df.vc/pictures/9_deadly_words_user_by_a_women.jpg
  [... snibbedy snib ...]
 The stuff we want begins after 'References'. We don't have to parse this manually, thanks to regular expressions. A simple script that would grep the URLs and then feed the whole mess to wget, while also checking that we're not pointlessly downloading files we already retrieved:
 url="http://ls.df.vc/"
 # we only want files ending in '.jpg' or '.png' for now
 grep 'http://.*\.(jpg|png)' listing.txt -oh | while read url; do
  filename="$(basename "$url" | urldecode)"
  if [[ ! -f "$filename" ]]; then
    wget -c "$url"
  fi
 done
 The urldecode script is nothing to write home about, although it's been working flawlessly so far (just put in your ~/bin dir):
 #!/usr/bin/env perl

 use URI::Encode;

 my $uri = URI::Encode->new({ encode_reserved => 0 });
 while (<>)
 {
    print $uri->decode($_)
 }
 However...

 While using bash works fine, it's really slow. Especially for large directories, a good percentage of waiting is literally just bash doing its thing. This is mostly because bash has to fork/exec all the time, and fork is a rather heavy function call.
 We can massively improve speed by using a different (better, maybe) language. I'm using Ruby, although you can do the same in Python, Perl, Lisp, even PHP if you feel that way. I don't judge.
 So here's the Ruby script that does the same thing:
 url = "http://ls.df.vc/"

 File.open("listing.txt", "r") do |fh|
  listing = fh.read
  listing.scan(/\d\d?\d?\d?\d?\d?. (http:\/\/.*.(jpg|png))/).each do |m|
    furl = m.shift
    base = File.basename(furl)
    filename = URI.unescape(base)
    if not File.file?(filename) then
      system("wget", "-c", furl)
    end
  end
 end
 By the way, you don't need to run Linux, Unix, et cetera for these scripts to work: For the bash variant, MinGW will work just fine. For the ruby script I strongly suggest installing cygwin -- it's easy, doesn't invade your %PATH%, and you get a proper POSIX environment on windows. What more could you possibly ask for?
 Also, I hope this post isn't messing with the subreddit rules. Just trying to help. If you have any questions, just go ahead! I'll answer them (if I can).
	disclaimer: the URL used is just an example. I've picked the first from /r/opendirectories that worked good enough
	Not sure if this post is ok, but I figured I'd share some tips for mass-downloading things from directories.
	I've used this for dirs with lots and lots of say, mp3 files. It's easily customized, and better yet, fast!
	If the server in question uses apache (which thankfully most do), retrieving a textual representation of the URLs that is stored in a local file is usually the first step one should do. That way the index doesn't have to be re-downloaded all the time, thus hugely improving speed.
	I found lynx does the best job at this, because lynx comes with a feature that transforms the HTML tree into a textual representation:
	lynx -dump http://ls.df.vc/pictures/ > listing.txt
	results in this (for example):
	Index of /pictures/

	Name Last Modified Size Type
	[1]Parent Directory/ - Directory
	[2]avatar/ 2009-Oct-10 21:35:43 - Directory
	[3]duckie/ 2009-Jul-18 03:02:57 - Directory
	[ ... many more lines skipped ...]
	lighttpd/1.4.35

	References

	1. http://ls.df.vc/
	2. http://ls.df.vc/pictures/avatar/
	3. http://ls.df.vc/pictures/duckie/
	4. http://ls.df.vc/pictures/dump/
	5. http://ls.df.vc/pictures/gif/
	6. http://ls.df.vc/pictures/hayka/
	7. http://ls.df.vc/pictures/mock_the_war/
	8. http://ls.df.vc/pictures/record_store_gats/
	9. http://ls.df.vc/pictures/404_gf_not_found.jpg
	10. http://ls.df.vc/pictures/8bit_wedding.jpg
	11. http://ls.df.vc/pictures/9_deadly_words_user_by_a_women.jpg
	[... snibbedy snib ...]
	The stuff we want begins after 'References'. We don't have to parse this manually, thanks to regular expressions. A simple script that would grep the URLs and then feed the whole mess to wget, while also checking that we're not pointlessly downloading files we already retrieved:
	url="http://ls.df.vc/"
	# we only want files ending in '.jpg' or '.png' for now
	grep 'http://.*\.(jpg\|png)' listing.txt -oh \| while read url; do
	filename="$(basename "$url" \| urldecode)"
	if [[ ! -f "$filename" ]]; then
	wget -c "$url"
	fi
	done
	The urldecode script is nothing to write home about, although it's been working flawlessly so far (just put in your ~/bin dir):
	#!/usr/bin/env perl

	use URI::Encode;

	my $uri = URI::Encode->new({ encode_reserved => 0 });
	while (<>)
	{
	print $uri->decode($_)
	}
	However...

	While using bash works fine, it's really slow. Especially for large directories, a good percentage of waiting is literally just bash doing its thing. This is mostly because bash has to fork/exec all the time, and fork is a rather heavy function call.
	We can massively improve speed by using a different (better, maybe) language. I'm using Ruby, although you can do the same in Python, Perl, Lisp, even PHP if you feel that way. I don't judge.
	So here's the Ruby script that does the same thing:
	url = "http://ls.df.vc/"

	File.open("listing.txt", "r") do \|fh\|
	listing = fh.read
	listing.scan(/\d\d?\d?\d?\d?\d?. (http:\/\/.*.(jpg\|png))/).each do \|m\|
	furl = m.shift
	base = File.basename(furl)
	filename = URI.unescape(base)
	if not File.file?(filename) then
	system("wget", "-c", furl)
	end
	end
	end
	By the way, you don't need to run Linux, Unix, et cetera for these scripts to work: For the bash variant, MinGW will work just fine. For the ruby script I strongly suggest installing cygwin -- it's easy, doesn't invade your %PATH%, and you get a proper POSIX environment on windows. What more could you possibly ask for?
	Also, I hope this post isn't messing with the subreddit rules. Just trying to help. If you have any questions, just go ahead! I'll answer them (if I can).