Skip to content

Instantly share code, notes, and snippets.

@raphey
Last active March 15, 2023 18:44
Show Gist options
  • Save raphey/9522723d6d5891f4398ecf5a1fe7fcb6 to your computer and use it in GitHub Desktop.
Save raphey/9522723d6d5891f4398ecf5a1fe7fcb6 to your computer and use it in GitHub Desktop.
Scraping articles from a news site using BeautifulSoup
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Scraping article data from a news site\n",
"\n",
"The purpose of this notebook is to scrape all of the headlines, authors, and teasers from a news site, in this case theatlantic.com. You can find more detailed tutorials [here](https://code.tutsplus.com/tutorials/scraping-webpages-in-python-with-beautiful-soup-search-and-dom-modification--cms-28276) and [here](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html).\n",
"\n",
"### Grab the webpage"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<!doctype html>\n",
"\n",
"<html class=\"no-js\" lang=\"en\" prefix=\"og: http://ogp.me/ns#\">\n",
"\n",
"<head data-template-set=\"html5-reset\" prefix=\"og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# article: http://ogp.me/ns/article#\">\n",
"\n",
" <meta charset=\"utf-8\">\n",
"\n",
" \n",
" <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">\n",
" <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n",
"\n",
" <title>The Atlantic</title>\n",
"\n"
]
}
],
"source": [
"import requests\n",
"\n",
"# Fetch the web page\n",
"r = requests.get(\"https://www.theatlantic.com\")\n",
"\n",
"# Print a portion\n",
"print(r.text[8:425])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parse with BeautifulSoup"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"\n",
"# Feed page into BeautifulSoup\n",
"soup = BeautifulSoup(r.text, \"html5lib\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inspecting the elements of the webpage, it looks like every article with a title and a summary is contained inside a pair of article tags. The classes of the article tags vary (c-cover-story, c-feature, etc), but they all appear to have the same internal structure, including a title inside a header with class=\"o-hed\", a teaser inside a paragraph with class=\"o-dek\", and an author inside a list item with class=\"o-meta__author\". There are actually some exceptions--more on that in a bit. For now, we grab all the articles and take a closer look at one."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"<article class=\"c-feature c-feature--sans-image\" data-omni-click=\"r'homepage',r'',d,r'homepage',r'0',r'543027'\">\n",
"\n",
" \n",
" <div class=\"c-feature__image\">\n",
" <figure class=\"o-media c-feature__media\">\n",
" <a class=\"o-media__object\" data-omni-click=\"inherit\" href=\"https://www.theatlantic.com/politics/archive/2017/10/microsoft-email-warrant-case/543027/\">\n",
" <picture>\n",
" <source media=\"(min-width: 576px)\" srcset=\"https://cdn.theatlantic.com/assets/media/img/mt/2017/10/RTX35Z6K/464x310.jpg?mod=1508175548\"/>\n",
" <img alt=\"The Microsoft logo reflected in a window\" class=\"o-media__img\" srcset=\"https://cdn.theatlantic.com/assets/media/img/mt/2017/10/RTX35Z6K/576x324.jpg?mod=1508175548\"/>\n",
" </picture>\n",
" </a>\n",
" </figure>\n",
" </div>\n",
" \n",
"\n",
" <div class=\"c-feature__content\">\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
" <h2 class=\"o-hed c-feature__hed\">\n",
" <a class=\"c-feature__hed-link\" data-omni-click=\"inherit\" href=\"https://www.theatlantic.com/politics/archive/2017/10/microsoft-email-warrant-case/543027/\">\n",
" Should Federal Prosecutors Be Able to Search Americans' Emails Overseas?\n",
" </a>\n",
" </h2>\n",
" \n",
"\n",
" \n",
" <p class=\"o-dek c-feature__dek\">The Supreme Court will resolve a standoff between Microsoft and federal prosecutors who want access to customer data stored in Ireland.</p>\n",
" \n",
"\n",
" \n",
" <div class=\"o-meta o-meta--small c-feature__meta\">\n",
" <ul class=\"o-meta__byline\">\n",
" \n",
" <li class=\"o-meta__author\">\n",
" <a data-omni-click=\"r'homepage',d,r'author',@href\" href=\"https://www.theatlantic.com/author/matt-ford/\">Matt Ford</a>\n",
" </li>\n",
" \n",
" </ul>\n",
" </div>\n",
" \n",
"\n",
" </div>\n",
"\n",
"</article>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Find everything inside <article> tags\n",
"articles = soup.find_all(\"article\")\n",
"\n",
"# Look at one of them\n",
"articles[6]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extract data\n",
"\n",
"Try pulling out title, the description, and author for one of the articles."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"\"Should Federal Prosecutors Be Able to Search Americans' Emails Overseas?\""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extract title--for this one we need to allow for varying header levels\n",
"\n",
"articles[6].select_one(\"h1 a, h2 a, h3 a\").get_text().strip()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/plain": [
"'The Supreme Court will resolve a standoff between Microsoft and federal prosecutors who want access to customer data stored in Ireland.'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extract description (the \".o-dek\" is probably unnecessary, since this presumably is the only <p> element)\n",
"articles[6].select_one(\"p.o-dek\").get_text().strip()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/plain": [
"'Matt Ford'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extract author\n",
"articles[6].select_one(\"li.o-meta__author\").get_text().strip()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we go through the list of articles and get all the title/author/summary info. One wrinkle alluded to earlier is that one class of articles, c-story-strip, is missing teaser descriptions, so we'll only include the three types of article that we know work. Another twist is that some articles don't have authors, so we'll write some extra code to handle that."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/plain": [
"('A Tax Proposal That Could Lift Millions Out of Poverty',\n",
" 'The Earned Income Tax Credit is one of the country’s most effective anti-poverty policies, but it mostly leaves out a huge segment of workers: those without children.',\n",
" 'Gene B. Sperling')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Find all course summaries, extract title and description\n",
"article_data = []\n",
"article_classes = [\"c-cover-story\", \"c-feature\", \"c-tease\"]\n",
"\n",
"for article in articles:\n",
" if not any(ac in article[\"class\"] for ac in article_classes): # skip articles that aren't classes we want\n",
" continue\n",
" \n",
" title = article.select_one(\"h1 a, h2 a, h3 a\").get_text().strip()\n",
" \n",
" teaser = article.select_one(\"p.o-dek\").get_text().strip()\n",
" \n",
" author_grab = article.select_one(\"li.o-meta__author\")\n",
" if author_grab is None:\n",
" author = \"\"\n",
" else:\n",
" author = author_grab.get_text().strip()\n",
" \n",
" article_data.append((title, teaser, author))\n",
"\n",
"# Example of article_data\n",
"article_data[6]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Print articles"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"*** The Great Undoer ***\n",
"Donald Trump’s biggest political wins have come in dismantling existing policies, but constructive and proactive steps have eluded him.\n",
"- David A. Graham\n",
"\n",
"*** Will Northern California Soon Have Southern California's Climate? ***\n",
"“The comparison is not a bad one.”\n",
"- Robinson Meyer\n",
"\n",
"*** Should Federal Prosecutors Be Able to Search Americans' Emails Overseas? ***\n",
"The Supreme Court will resolve a standoff between Microsoft and federal prosecutors who want access to customer data stored in Ireland.\n",
"- Matt Ford\n",
"\n",
"*** Google Maps' Failed Attempt to Get People to Lose Weight ***\n",
"Not everyone wants to be reminded of calories when trying to get directions.\n",
"- James Hamblin\n",
"\n",
"*** The Trumps and the Gospel of Winning ***\n",
"Ivana Trump’s new book is a parenting memoir—and a revealing exploration of presidential family values.\n",
"- Megan Garber\n",
"\n",
"*** Imagining the Future Is Just Another Form of Memory ***\n",
"Humans’ ability to predict the future is all thanks to our ability to remember the past.\n",
"- Julie Beck\n",
"\n",
"*** A Tax Proposal That Could Lift Millions Out of Poverty ***\n",
"The Earned Income Tax Credit is one of the country’s most effective anti-poverty policies, but it mostly leaves out a huge segment of workers: those without children.\n",
"- Gene B. Sperling\n",
"\n",
"*** The Quixotic Effort to Get a Better Brexit Deal ***\n",
"Some British lawmakers want the power to stop the U.K. from leaving the EU without a trade agreement—but it may not be so simple.\n",
"- Yasmeen Serhan\n",
"\n",
"*** The Threat of Polio in the Badlands of Boko Haram ***\n",
"The last vestiges of the devastating virus persist where terrorists stop vaccines from reaching babies and children.\n",
"- Jo Chandler\n",
"\n",
"*** The War on ISIS Held the Middle East Together ***\n",
"“With the fall of Raqqa, the sad story will pick up exactly where it left off in 2014.”\n",
"- Thanassis Cambanis\n",
"\n",
"*** Will Northern California Soon Have Southern California's Climate? ***\n",
"The Napa Valley wildfires are eerily similar to those that often flare up near Los Angeles.\n",
"- Robinson Meyer\n",
"\n",
"*** Kyrie Irving, the NBA’s Singular Star ***\n",
"While players around the league team up to chase the Warriors, the Celtics’ new point guard looks for a heavier burden.\n",
"- Robert O'Connell\n",
"\n",
"*** Trump's Nominee for Drug Czar Is Out ***\n",
"Representative Tom Marino withdrew from consideration after a recent news report detailed how a law he helped pass benefited narcotics distributors.\n",
"- Russell Berman\n",
"\n",
"*** The Great Undoer ***\n",
"Donald Trump’s biggest political wins have come in dismantling existing policies, but constructive and proactive steps have eluded him.\n",
"- David A. Graham\n",
"\n",
"*** What Hollywood Forgets About LBJ ***\n",
"Even as they stress his civil-rights legacy, popular portrayals ignore the issue that loomed largest over Lyndon B. Johnson's presidency: the Vietnam War.\n",
"- Julian E. Zelizer\n",
"\n",
"*** Negotiating With Al-Shabaab Will Get America Out of Somalia ***\n",
"“If this means more fire power, it will mean only more misery for the Somali people and their regional neighbors.”\n",
"- Helen C. Epstein\n",
"\n",
"*** The Atlantic Daily: Clashes and Crashes ***\n",
"Trump and McConnell make amends, Harvey Weinstein is expelled from the Academy, two neutron stars collide, and more.\n",
"- Rosa Inocencio Smith\n",
"\n",
"*** The Atlantic Politics & Policy Daily: Best Frenemies ***\n",
"During a news conference, President Trump said he and Senate Majority Leader Mitch McConnell are “closer than ever before.”\n",
"- Elaine Godfrey\n",
"\n",
"*** Why Trump Accused Obama of Not Consoling Families of Fallen Soldiers ***\n",
"The president touched off a brief firestorm with the unfounded charge, but real answers about why four service members were killed in Niger remain elusive.\n",
"- David A. Graham\n",
"\n",
"*** The Battles After ISIS ***\n",
"Iraqi forces face off against the Kurds in a potential harbinger of conflicts to come.\n",
"- Krishnadev Calamur\n",
"\n",
"*** How to Get More People to Ride the Bus ***\n",
"A driver, a transportation official, and a transit advocate explain why Seattle recently saw one of the biggest citywide increases in passenger numbers.\n",
"- Andrew Small\n",
"\n",
"*** A Remarriage of Convenience Between Donald Trump and Mitch McConnell ***\n",
"The president and the Senate majority leader make nice at the White House after a summer of bickering and ahead of a crucial period for the Republican agenda.\n",
"- Russell Berman\n",
"\n",
"*** 'Casting Couch': The Origins of a Pernicious Hollywood Cliché ***\n",
"How a seemingly innocuous phrase became a metonym for the skewed sexual politics of show business\n",
"- Ben Zimmer\n",
"\n",
"*** CityLab ***\n",
"The Atlantic, The Aspen Institute and Bloomberg Philanthropies will convene mayors and city practitioners from across the world for conversations on the future of cities.\n",
"\n",
"*** Derek Thompson and the Moonshot Factory ***\n",
"Inside the secretive lab where Google's parent company is researching advanced technology\n",
"\n"
]
}
],
"source": [
"for title, teaser, author in article_data:\n",
" print(\"*** \" + title + \" ***\")\n",
" print(teaser)\n",
" if author:\n",
" print(\"- \" + author)\n",
" print()"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment