An ultimate guide to where to search for online published news articles. Open-sourced & free tools only.
Build a News API alternative with open-source tools only.
Why News API Alternative?
Data scientists and NLP enthusiasts love working with news data because there are many real-world use cases:
- topic clustering
- named entity recognition
- trend detection
- sentiment analysis
- trading on news
While building NewsCatcher News API, we discovered many open-sourced & free tools, services, libraries that help you find & parse online-published news articles.
We even published two Python packages that help work with news data:
- newscatcher - programmatically collect normalized news from (almost) any website.
- pygooglenews - your own Google News API alternative in Python.
While there are a few paid options (including what NewsCatcher does), some non-commercial use cases might be satisfied with open-sourced & free options. So you don't have to pay to News API providers.
Who is this list for?
- Students/portfolio builders Are you searching for a data science/data engineering job? You'd need to prove you can deliver some results. Here's one example of a data engineering project from Damian Kliś. Off-topic comment: one thing that makes Damian's GitHub repository stand out is that he made a clear and concise README. If you add any GitHub repository to your CV, you'd better explain well what is inside. No-README repo might harm you more than help you find a job: it shows you don't care about documenting & illustrating your work. It's a red sign for people who'd consider hiring you.
- Side-project This list might inspire you for your next side-project here. For example, you may build another news aggregator.
- Indie hackers Don't have money to pay news data providers? Try building your tool.
1. GDELT 2.0 Global Knowledge Graph
GDELT analyses news articles published online. They apply Natural Language Processing to understand what news is being written worldwide. In addition, the GKG dataset allows you to find the links to newly published news articles.
- ~400,000 news articles/day
- updates every 15 minutes
- worldwide multi-language coverage
- five years of the news archive
- just URL to the article: you have to scrape & extract it yourself
- not consistent in terms of delay and coverage
You might think that a list of URLs isn't much, but I bet you might be wrong. It's a half job done. For example, you could use the newspaper3k Python package for parsing news by ULR/its HTML.
2. News Crawl by Common Crawl
Common Crawl crawls the web and open-source all of the online pages they could have found. They are non-profit, so I highly encourage you to donate to them if you'll end up using their solution.
In 2016, Common Crawl decided to decouple the news crawl part from their primary dataset. News Crawl uses RSS & news sitemaps to parse the news. This part of Crawl is separately open-sourced.
- ~600,000 news articles/day
- worldwide multi-language coverage
- few years of history
- full HTML of a page
- updates multiple times a day
- you still need to parse the content from the HTML
3. RSS Feeds
RSS feeds still exist. Our beta version used to rely solely on RSS feeds. You can read a full article here:
- partially structured & contains some data points (title, published date)
- you have to find the RSS feed – it's not a trivial task when you need it at scale
- news provider can turn off RSS feed at any time
4. Google News (RSS) - Google News API Alternative
Google News is the biggest UI-first news aggregator.
Google News has an RSS for any UI page. This RSS is lightweight, and you will not get blocked for accessing it many times a day.
We wrote a Python library that helps you parse any Google News RSS page. Even if you are not a Python person, you can use this repository as an unofficial Google News RSS documentation (there is no official one).
This list is a good starting point if you'd like to experiment with news data.
Building your News API alternative may teach you more about the subject itself.
Plus, it's a great data engineering exercise.