Google News RSS search parameters. The missing documentation

Google News RSS search parameters. The missing documentation
TL;DR You can get a pretty narrowed Google News RSS feed of aggregated news: search by keyword, geo position, time range, topic, etc. You just need to know the syntax. Unfortunately, Google does not provide any official documentation, so we'll try to fill the gap.

We are open-sourcing a lot of our work and building our company in public. In this post, we would like to share all of our findings of Google News RSS feed (which appeared to be much more useful than we initially thought).

Intro

About two months ago Reuters killed their RSS feed. Without a notice.

I quickly wrote this article where I explained how you could partially fix it using a Google News RSS "hack".

It went trending on HackerNews, so I believe RSS is not dead. In this post, I will write down all the Google News RSS syntax that I've been figuring out for a few months.

Why Google News RSS?

1. To integrate it into your RSS feed reader

2. Web scraping, or maybe "smart web scraping". Google's RSS feed contains the same data as Google News UI version (except the thumbnail image); however, it is:

  1. much easier to scrape
  2. the RSS page is super light
  3. you're not getting blocked for doing many requests (not that fast as with UI)
If you want to know more about Google News UI vs RSS comparison, read this article of mine.

Our main News API solution does not depend on Google News; still, we took some time to understand how Google News works...

4 types of Google News RSS

There are 4 main feeds that could be generated. Here are one-liners for each one:

1. Top headlines - get the latest trending news headlines for your country

Example URL: https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en

2. Headlines by topic - get the latest topic-oriented news headlines for your country

Example URL: https://news.google.com/rss/headlines/section/topic/TECHNOLOGY?hl=en-US&gl=US&ceid=US:en

3. Location headlines - get the latest location-oriented news headlines (city, state, country, etc)

Example URL: https://news.google.com/rss/headlines/section/geo/NY?hl=en-US&gl=US&ceid=US:en

4. News by your search criteria - use the full power of the most advanced search engine: search by keywords, websites, dates, or any of these combined.

Example URL: https://news.google.com/rss/search?q=intitle:AAPL+when:1h&hl=en-US&gl=US&ceid=US:en

Common things through all Google News RSS feed types

1. 100 articles max - no matter what you want to do, one call to Google's RSS will not give you more than 100 articles per one search.

2. Country & language - not all countries & languages are supported. To check the available country & language combinations check the bottom left of the Google News UI

3. Google News RSS URL always starts by https://news.google.com/rss

1. Top Headlines

Copy-paste https://news.google.com/rss in your browser and you will be forwarded to the main Google News feed for your country & language. If it is the US then most likely you'll end up with https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en

hl: language

gl: country

ceid: country: language

You can modify these to change the feed to your country and language.

And, that is pretty much all you do to get the latest headlines in RSS.

2. Headlines By Topic

Accepted topics are:

  • WORLD
  • NATION
  • BUSINESS
  • TECHNOLOGY
  • ENTERTAINMENT
  • SCIENCE
  • SPORTS
  • HEALTH

For each allowed country+language combination you can get these topic-oriented feeds.

US-English BUSINESS topic example:

https://news.google.com/rss/headlines/section/topic/BUSINESS?hl=en-US&gl=US&ceid=US:en

To break it down:

Just change the <TOPIC> part to any of the 8 allowed topics to get specialized feeds.

"Hidden" Topics

Yes, there are "hidden" topics. If you already tried to insert the "BUSINESS" url from the section above in your browser, you might have noticed that it is being forwarded to another URL:

https://news.google.com/rss/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGx6TVdZU0JXVnVMVlZUR2dKVlV5Z0FQAQ?hl=en-US&gl=US&ceid=US:en

To break it down:

Apparently, this hash string (CAAqKggKIiRDQkFTRlFvSUwyMHZNRGx6TVdZU0JXVnVMVlZUR2dKVlV5Z0FQAQ) is what is BUSINESS topic is for Google News.

Initially, my thought was that those 8 topics are "special" because they work for all country&langugae combinations while others are not. But, what works for one language seems to work for all others.

You can go to UI version of Google News; start typing something into the console. If what you are searching for is available as a theme then you just can copy its topic hash and use it within RSS.

us_election_ex.png

us_elections_hash.png

So, our US election oriented RSS URL will look like:

https://news.google.com/rss/topics/CAAqKAgKIiJDQkFTRXdvTkwyY3ZNVEZpZDJ0a2JtYzRjQklDWlc0b0FBUAE?hl=en-US&gl=US&ceid=US:en

3. Location Headlines

Find news that talks about a specific place.

US-English New York example:

1. https://news.google.com/rss/headlines/section/geo/NY?hl=en-US&gl=US&ceid=US:en

2. https://news.google.com/rss/headlines/section/geo/New York?hl=en-US&gl=US&ceid=US:en

3. https://news.google.com/rss/headlines/section/geo/NewYork?hl=en-US&gl=US&ceid=US:en

All of the above 3 links will be redirected to https://news.google.com/rss/topics/CAAqIggKIhxDQkFTRHdvSkwyMHZNREpmTWpnMkVnSmxiaWdBUAE?hl=en-US&gl=US&ceid=US:en

Therefore, locations are also topics, however, Google will help you find it even when you're using the RSS!

Once again, you may copy the topic hash string, and use it for any country&language combination.

4. Advanced Search

Everything up to this point was more or less known when we started our "investigation". This part is 90-95% of the time spent to figure out what we could actually achieve with Google News RSS feed.

In short, you can search for news indexed by Google's engine within RSS. It is a big deal because you can web scrape news links from Google by loading a 30KB RSS web page instead of a 1MB+ UI version of it.

Let's start with a simple search. Let's say we want to read the latest articles about Elon Musk:

https://news.google.com/rss/search?q=Elon%20Musk&ceid=US:en&hl=en-US&gl=US

q=Elon%20Musk is the part we are interested in.

4.1. q parameter advanced options

1. Boolean OR Search [OR] - the default behavior for Google News RSS is to put AND between each term you put into q parameter. So, Elon Musk is actually Elon AND Musk if you want to search for at least one should match you should use OR parameter. For example, to search for articles that mention SpaceX or Boeing:

https://news.google.com/rss/search?q=SpaceX%20OR%20Boeing&ceid=US:en&hl=en-US&gl=US

q param: q=SpaceX%20OR%20Boeing (q=SpaceX OR Boeing)

2. Exact Match ("your exact match search") - use quotes to perform exact match querying. Must use when working with company names, persons, and places.

3. Exclude Query Term [-]

"The exclude (-) query term restricts results for a particular search request to documents that do not contain a particular word or phrase. To use the exclude query term, you would preface the word or phrase to be excluded from the matching documents with "-" (a minus sign).

4. Include Query Term [+]

"The include (+) query term specifies that a word or phrase must occur in all documents included in the search results. To use the include query term, you would preface the word or phrase that must be included in all search results with "+" (a plus sign).

The URL-escaped version of + (a plus sign) is %2B"

4.2. Advanced search with a time range

I mentioned before that Google News RSS page can return only up to 100 results. So, if you want to scrape some data for your project you would need more than that. How? By iterating your query by some time range.

before & after parameters will allow you to search by date. Unfortunately, you can narrow down your search only by day (not time allowed). So, if there are more than 100 articles that match your query you will not be able to find them.

For example, if we want to find articles about Boeing for the first of July, 2020:

https://news.google.com/rss/search?q=Boeing+after:2020-06-01+before:2020-06-02&ceid=US:en&hl=en-US&gl=US

The query part: q=Boeing+after:2020-06-01+before:2020-06-02

You can also use one of two to make open-ended time searches.

when parameter sets the time range for the published datetime. I could not find any documentation regarding this option, but here is what I deducted:

  • h for hours. (For me, worked for up to 101h). when=12h will search for only the articles matching the search criteria and published for the last 12 hours
  • d for days
  • m for month (For me, worked for up to 48m)

For example, all articles about Boeing for the past hour: https://news.google.com/rss/search?q=Boeing+when:1h&ceid=US:en&hl=en-US&gl=US

4.3. Not just the q parameter

allintext

"The allintext: query term requires each document in the search results to contain all of the words in the search query in the body of the document. The query should be formatted as allintext: followed by the words in your search query.

If your search query includes the allintext: query term, Google will only check the body text of documents for the words in your search query, ignoring links in those documents, document titles and document URLs."

intitle

"The intitle: query term restricts search results to documents that contain a particular word in the document title. The search query should be formatted as intitle:WORD with no space between the intitle: query term and the following word."

allintitle

"The allintitle: query term restricts search results to documents that contain all of the query words in the document title. To use the allintitle: query term, include "allintitle:" at the start of your search query.

Note: Putting allintitle: at the beginning of a search query is equivalent to putting intitle: in front of each word in the search query."

inurl

"The inurl: query term restricts search results to documents that contain a particular word in the document URL. The search query should be formatted as inurl:WORD with no space between the inurl: query term and the following word"

allinurl

The allinurl: query term restricts search results to documents that contain all of the query words in the document URL. To use the allinurl: query term, include allinurl: at the start of your search query.

Here is a nice example of how you can use this parameter to get the news from Reuters (or any other news website which does not support RSS)

______

Let me know if there is something I missed at artem [at] newscatcherapi.com

______

We're NewsCatcherAPI team: we're building the best news search engine API. Our News API is used by over 350 users for:

  • Algotrading
  • PR & Media Monitoring
  • Custom News Aggregators
  • Market Research & Analysis
  • ML projects