Created
October 2, 2016 17:12
-
-
Save suranands/e1e2c4ca06b2b4a14140b8b5d5d22d24 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Following Links in Python | |
In this assignment you will write a Python program that expands on | |
http://www.pythonlearn.com/code/urllinks.py (http://www.pythonlearn.com/code/urllinks.py). The program will | |
use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a | |
tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a | |
number of times and report the last name you find. | |
We provide two files for this assignment. One is a sample file where we give you the name for your testing and | |
the other is the actual data you need to process for the assignment | |
- Sample problem: Start at http://pythondata.drchuck.net/known_by_Fikret.html (http://pythondata.dr | |
chuck.net/known_by_Fikret.html) | |
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer | |
is the last name that you retrieve. | |
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah | |
Last name in sequence: Anayah | |
- Actual problem: Start at: http://pythondata.drchuck.net/known_by_Inaara.html (http://pythondata.dr | |
chuck.net/known_by_Inaara.html) | |
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The | |
answer is the last name that you retrieve. | |
Hint: The first character of the name of the last page that you will load is: R | |
Strategy | |
The web pages tweak the height between the links and hide the page after a few seconds to make it difficult for | |
you to do the assignment without writing a Python program. But frankly with a little effort and patience you can | |
overcome these attempts to make it a little harder to complete the assignment without writing a Python | |
program. But that is not the point. The point is to write a clever Python program to solve the program. | |
""" | |
import re, urllib | |
from BeautifulSoup import * | |
all_links = [] | |
all_names = [] | |
url_first_part = 'http://python-data.dr-chuck.net/known_by_' | |
url_last_part = '.html' | |
first_entry = 'Inaara' | |
for i in range(7): | |
url = url_first_part + first_entry + url_last_part | |
html = urllib.urlopen(url).read() | |
soup = BeautifulSoup(html) | |
#def get_next_name(url) | |
tags = soup('a') | |
links = [] | |
for tag in tags: | |
links.append(tag.get('href', None)) | |
url = links[17] | |
print url | |
name = url[41:] | |
next_entry = name[:-5] | |
all_names.append(next_entry) | |
first_entry = next_entry | |
url = url_first_part + first_entry + url_last_part | |
all_links.append(url) | |
print all_names[-1] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context() # Ignore SSL certificate errors
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter URL: ')
c = input('Enter count: ')
pos = input('Enter position: ')
print(url)
for i in range(int(c)):
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a')
url = tags[int(pos)-1].get('href',None)
print(url)
#you get an URL, print it. Then you look for the anchor tag at index which is pos - 1 and get the key value i.e, url present in the href attribute
#then you open that url, print it and do the same for the remaining number of times (count no. of times)
#in the end retrieve the last url