How NewsCatcher works
Learn more about what it takes to deliver online-published news to your API response object
News API
Our core product is a JSON News API: we enable you to find relevant news articles based on your query parameters, such as keyword, language, country, news website, etc.
In order to find relevant results only - among the more than a million news articles that we index daily - the API makes a call to ElasticSearch clusters.
ElasticSearch is a powerful tool to index text data.
But, the REST API itself is just the tip of the iceberg: the bulk of the work is to bring structured news articles into the ElasticSearch cluster.
Part I: Web crawling (How we find news data)
News gets published online every second. To keep up, we have to constantly monitor news websites and check for updates. So we crawl the web. We don’t crawl the entire web like Google does, just the news parts of it.
We have a list of over 60,000 news websites that produce new articles (new data points).
Our web crawlers constantly find new web pages. The next step would be to check whether these pages have been processed already. Then, our algorithms decide whether this page is a news article page or not.
At the end of this process, we have a stream of HTML pages of news articles that were recently published online.
Part II: Data extraction (How we turn unstructured text into structured data)
Now, we have to turn the HTML page into a source of structured data, such as title, article body, published time, etc.
This is the most advanced and complicated part of what we do, as we have to write a generic news parser: it has to extract data fields from any news website in any language.
Another more obvious option would be to write a dedicated web scraper for each news website. But it becomes a much more difficult task when you have ~100,000 news websites to get news articles from.
Part III: Data enrichment
After we parse the fields from the article HTML we have to enrich them with more data points:
- article language
- publishers’ country
- publishers’ page rank
- publishers’ topic
- whether this article is “Opinion” or not
Part IV: Indexing
Now, each news article is structured in the following JSON format:
We may have over 1.5M such articles daily, so we’d need the most efficient tool to index this data.
We use ElasticSearch for this purpose. It enables our NEWS API to respond to your advanced queries in less than 300ms.
Part V: Data delivery
After being indexed, news articles are delivered to you via a REST JSON API.
Each call is transformed into an ElasticSearch query.
Was this page helpful?