Skip to content

Instantly share code, notes, and snippets.

@jcharvet
Created August 20, 2017 15:27
Show Gist options
  • Save jcharvet/7e7802266cd1fe712bc0146c8dcafb67 to your computer and use it in GitHub Desktop.
Save jcharvet/7e7802266cd1fe712bc0146c8dcafb67 to your computer and use it in GitHub Desktop.
A python script that parse a html link to collect the url and download the files from those url
import urllib.request, urllib.parse, re
url = 'some url'
fp = urllib.request.urlopen(url)
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', mystr)
suffix = '.pdf'
#print(links)
for link in links:
# fp = urllib.request.urlopen(link)
# m = fp.read()
url = link
file_name = url.split('/')[-1]
u = urllib.request.urlopen(url)
f = open(file_name, 'wb')
#meta = u.info()
file_size = int(u.getheader('Content-Length'))
print("Downloading: %s Bytes: %s" % (file_name, file_size))
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8) * (len(status) + 1)
print(status)
f.close()
appdirs==1.4.3
awsebcli==3.10.1
blessed==1.14.2
botocore==1.5.42
ccm==2.8.1
cement==2.8.2
colorama==0.3.7
configparser==3.5.0
Django==1.11.4
docker-py==1.7.2
dockerpty==0.4.1
docopt==0.6.2
docutils==0.13.1
hurry.filesize==0.9
jmespath==0.9.2
packaging==16.8
pathspec==0.5.0
pyparsing==2.2.0
python-dateutil==2.6.0
python3-wget==0.0.2b1
pytz==2017.2
PyYAML==3.12
recordtype==1.1
requests==2.9.1
semantic-version==2.5.0
six==1.10.0
tabulate==0.7.5
termcolor==1.1.0
urlparse2==1.1.1
wcwidth==0.1.7
websocket-client==0.40.0
wget==3.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment