4 Python Web Scraping Libraries To Mine News Data

4 Python Web Scraping Libraries To Mine News Data

Four easy-to-use open-sourced Python web scraping libraries to help you build your own news mining solution for your next NLP project, news aggregator, etc.


All libraries listed in this article work without any API or service: you can start using it straight away.


Each library mentioned in this article is accompanied with an interactive Python shell that you can run in this tab.


News data aggregation and extraction are performed by many on a commercial scale. For these use cases, various services like our own— NewsCatcher's News API exists, but in case you are going for a DIY solution for your mini-project, you can use some ready-to-use Python Libraries or a collection of them. You can use these libraries as news web scrapers to fetch news data through a few lines of code in Python.


For today’s demonstration, we will be looking at someb Python "web scraping" libraries in particular that can help you in mining news articles. These can work well as news API alternatives when used on a small scale for a limited use case:


PyGoogleNews

pygooglenews-demo.gif

This library, created by the NewsCatcher Team, acts like a Python wrapper for Google News or an unofficial Google News API. Pygooglenews is a web scraping library that is based on one simple trick: it exploits a lightweight Google News RSS feed.


GitHub link

To install run: pip install pygooglenews


In simple terms, it acts as a wrapper library for the Google RSS feed, which you can easily install using PIP and then import into your code. 


What data points can it fetch for you?

  • Top stories
  • Topic-related news feeds
  • Geolocation specific news feed
  • An extensive query-based search feed


The code above shows how you can extract certain data points from the top news articles in the Google RSS feed. You can replace the code “gn.top_news()” with “gn.topic_headlines('business')” to get the top headlines related to “Business” or you could have replaced it with “gn.geo_headlines('San Fran')” to get the top news in the San Fransisco region.


You can also use complex queries such as “gn.search('boeing OR airbus')” to find news articles mentioning Boeing or Airbus or “gn.search('boeing -airbus')” to find all news articles that mention Boeing but not Airbus.


When web-scraping news articles with this library, for every news entry that you capture, you get the following data points, that you can use for data processing or training your machine learning model, or running NLP scripts:

  1. Title - contains the Headline for the article
  2. Link - the original link for the article
  3. Published - the date on which it was published
  4. Summary - the article summary
  5. Source - the website on which it was published
  6. Sub-Articles - list of titles, publishers, and links that are on the same topic


We extracted just a few of the available data points, but you can extract the others as well, based on your requirements. Here’s a small example of the results produced by complex queries.


If you run the code below:

from pygooglenews import GoogleNews gn = GoogleNews() s = gn.search('boeing OR airbus') for entry in s["entries"]:     print(entry["title"])


You will be getting an output like this:

google news titles


So, we printed the titles of the articles that we got as a result of running the search based on a complex query, and you can see that each article is about Boeing or Airbus. You can use other querying options as explained on the Github page of the library to perform even more complicated queries on the latest news using PyGoogleNews. This is what makes this library very handy and easy to use even for beginners.


NewsCatcher

This one is an open-source created by our team, that can be used in DIY projects. It’s a simple Python web scraping library that can be used for web scraping news articles from almost any news website. You can also use certain functions to gather details related to a news website. Let’s elaborate with examples and running code.


To install run: pip install newscatcher

GitHub link



In case you want to grab the headlines from a news website, you can just create a Newscatcher object passing the website URL (remember to remove the HTTP and the www and just provide the website name and extension), and use the get_headlines() function to obtain the top headlines from the website. If you run the code below:


from newscatcher import Newscatcher, describe_url mm = Newscatcher(website = 'mediamatters.org') for index, headline in enumerate(mm.get_headlines()):   print(index, headline)



You will be receiving the top headlines in the output:

newscatcher demo


We have truncated the results, but you can run the same in your system, to view all the results.  In case you want to view all the data points related to a particular news article, you will have to choose a different route.


In the code above, we used the get_news() function to get the top news from nytimes.com. While extracting just a few of the data points, you can get all of them for further processing:

  • Title
  • Link
  • Authors
  • Tags
  • Date
  • Summary
  • Content
  • Link for Comments
  • Post_id


We ran the code to obtain the JSON shown below. The tags can come in very handy in case you want to sort through hundreds of news articles or store them in cloud storage in a format such that they can be used later on in your NLP or ML projects.

newscatcher article JSON


While these were the tools to obtain news information, you can also use the “describe_url” function to get details related to websites. For example, we took 3 news URLs, and obtained this information related to them:


from newscatcher import describe_url websites = ['nytimes.com', 'cronachediordinariorazzismo.org', 'libertaegiustizia.it'] for website in websites:   print(describe_url(website))


We got the data points such as URL, language, country, and topics for all the websites that we passed in a list.

newscatcher describe URL


You can see how it identified the 2nd and 3rd websites to be of Italian origin and the topics for all 3. Some data points like the country may not be available for all the websites since they are providing services worldwide.


Feedparser

This Python library runs on Python3.6 or later and can be used to parse syndicated feeds. In short, it can parse RSS or Atom feed and provide you the information in the form of data points easily. It acts as a news scraper and we can use it to mine news data from RSS feeds of different news websites.


To install run: pip install feedparser

GitHub link


By default, you would need to first find the RSS URL for feedparser to parse. However, in this article, we will use feedparser in conjunction with the feedsearch Python library that can be used to find RSS URLs by scraping the URL of a news website.



The code above first uses feedsearch to find RSS links from the NYTimes website, then uses feedparser to parse the first RSS feed.

To install run: pip install feedsearch


If feedsearch cannot find еру RSS feed of a website there is a more advanced version with crawler called feedsearch-crawler.


Newspaper3k

NewsPaper3k is a Python library for web scraping news articles by just passing the URL. A lot of the libraries that we saw before gave us the content but along with a lot of HTML tags and junk data. This library would help you fetch the content and a few more data points from almost any newspaper article on the web.


This Python web scraping library can be combined with any of the libraries above to extract the full text body of the article.


To install run: pip install newspaper3k

GitHub link


For example, we ran the library on the latest article in NYTimes:



It is to be noted that both the text and the summary have been truncated as usual. You would get:

  • article text, free from any tags
  • authors
  • published date
  • thumbnail images for the article
  • videos if any attached to the article
  • keywords associated with the article
  • summary


Newspaper3k return result


Conclusion & Final Comparison

We created a simple comparison of all four Python web scraping libraries that can be used in DIY Python projects to create a content aggregator, to give a clear picture of the strengths and weaknesses.


PyGoogleNews

  • An alternative to Google News API
  • Fetches multiple data points for each news article
  • Keywords can be passed to find associated news
  • Complex queries with logical operators can be used


NewsCatcher

  • Can be used to get news data from multiple websites
  • Fetches multiple data points for each news article
  • You can filter news by topic, country, or language


Feedparser

  • Can be used to parse an RSS feed and obtain important information
  • Fetches multiple data points for the RSS feed passed


Newspaper3k

  • Helps extract all the data points from a news article link
  • Helps extract data points as well as NLP based results from a news article


By using this website you agree to our Policy.