News API
Blogs
How Does Our Local News API Work?
Product

How Does Our Local News API Work?

The Local News API provides city-specific news for over 31,000 U.S. locations, solving issues like keyword mix-ups and article relevance. Learn how it works and why it’s built this way.

How Does Our Local News API Work?

Local News API is a NewsCatcher product that you can use to get hyperlocal, city-focused news via an API. We currently cover over 31,000 locations in the U.S. We aggregate news articles, detect locations from them using advanced NLP techniques, and resolve keyword clashes to give you clean data with minimal effort on our customer’s end. In this blog, we’ll discuss how and why we built the Local News API. For a quick primer on how you can use the API, you can read the corresponding blog: Introducing NewsCatcher Local News API.

The Need and Vision for Local News API

Consider gathering a precise hyperlocal news feed, aggregating all the news articles at the level of a particular city. Using the city name as a search keyword in a News Search Engine or a generalized News API does not suffice. Firstly, there are keyword clashes. A location name could clash with other keywords, for example, searching for “Page” on Google News Search returns news about pagers exploding in Lebanon, while our customer might have been looking for news about the city Page in Arizona. There could also be multiple locations with the same name—a classic example is Washington state vs. Washington D.C. Second, there is also the problem of hierarchy. A news article about an event in Brooklyn should ideally show up under New York news even if there is no mention of the words “New York” anywhere in the article. But this cannot happen with a naive keyword-based search.

Multiple customers of ours required this kind of news feed. For instance, one of our customers wanted to integrate a local news feed into their neighborhood-based community networking app. Another customer was looking to use local news to track any negative news about the bonds in their investment portfolio.

At NewsCatcher, we gather and process around two hundred thousand news articles each day from within the US alone, and we decided to use this data to provide precise city-level news to our customers. With the vision of providing a no-frill, location-based news feed to our customers, we built the Local News API.

Tech Stack and Architecture

The Local News API is an extension of our core tech stack which mostly runs on Python. We use the FastAPI framework to serve our APIs. Being a widely used general-purpose language, Python was a convenient and robust choice for both handling the news data and serving the API. For the API server, FastAPI is a high-performance framework that is quite developer-friendly and also supports async/await.

Our stack consists of multiple services, and by using a microservice architecture, we keep each service independently deployable and maintainable. We also keep our architecture clean and adaptable by separating business logic from external dependencies. This helps us upgrade or switch external dependencies when necessary, without affecting the core business logic. The different services communicate with each other using RabbitMQ.

Data Sourcing and Ingestion

For the Local News API, we use the same data we source for our generalized News API Product, the V3 API. In essence, for our V3 API, we monitor around 90,000 news media outlets daily and scrape every news article that is legally permitted and technically feasible to scrape. We then clean the article text, extract the source metadata such as title, publication date, author, etc. Post that, we also enrich the data with sentiment analysis, entity recognition, and so on. We then ingest all the news data into ElasticSearch to facilitate different queries and filters via the API.

The Data Pipeline for Local News API

For the Local News API, all we had to do is extend the above data pipeline. Articles are queued from there and processed by an orchestrator. It enriches the incoming news articles with locations and other entity data, by using various techniques such as rule-based matching, AI-based extraction and validation, and so on. Once the articles are enriched, they go to the routing service, which dispatches data to other general and client-specific services to be consumed by our customers via the API.

A schematic diagram showing our data pipeline

Localizing News Data

To start with, we set out to identify any mentions of the approximately 31,000 U.S. cities and towns in the news articles that we process. To achieve this, we followed a two-step process. In the first step, we use rule-based matching to simply assign locations to articles based on the presence of location names in the articles. If any locations are determined using the rule-based approach, we verify this using AI in the second step. In case we cannot determine locations using the rule-based approach, we use AI to directly identify the locations. Using this two-step process, we are able to localize news articles to the precision of town level with 84% accuracy.

After localization is done, the news data is now ready for querying by location. Let’s see an example API call:

r = requests.post(
    f'{NC_ENDPOINT}/api/latest_headlines',
    headers={'x-api-token': NC_API_TOKEN},
    json={
        'associated_towns': [{'name': 'New York'}],
        'page_size': 10,
        'when': '1d',
    }
)

print(json.dumps(r.json(), indent=2))

The above API request returns the following response:

{
  "status": "ok",
  "total_hits": 3,
  "page": 1,
  "total_pages": 1,
  "page_size": 10,
  "articles": [
    {
      "id": "5e05185a3499db5817f265fc354f1d52",
      "associated_town": [
        {
          "ai_validated": true,
          "name": "Rochester, New York",
          "description": [
            "HYPERLOCAL_SOURCES_EXCLUDE_QUERY",
            "HYPERLOCAL_SOURCES_INCLUDE_QUERY"
          ]
        },
        {
          "ai_validated": true,
          "name": "New York",
          "description": [
            "LOCAL_SOURCES_EXCLUDE_QUERY"
          ]
        }
      ],
      "ai_associated_town": null,
      "score": null,
      "title": "Vote: Section V's Girls Sports Athlete of the Week for Oct. 20-26 presented by Faber Builders",
      "author": "Marquel Slaughter",
      "link": "<https://www.democratandchronicle.com/story/sports/high-school/2024/10/28/who-is-section-v-girls-sports-athlete-of-the-week-for-oct-20-26-vote-now/75836114007>",
      "description": "Your vote will determine who will be the Faber Builders Girls Sports Athlete of the Week for October 20-26.",
      "media": "<https://www.democratandchronicle.com/gcdn/authoring/authoring-images/2024/09/06/PROC/75108463007-aotw-article-page-hdr-1200-x-628.jpg?crop=1115,627,x58,y0&width=1115&height=627&format=pjpg&auto=webp>",
      "content": "It's time to take a...(full content truncated)",
      "authors": [
        "Justin Ritzel",
        "James Johnson",
        "Marquel Slaughter"
      ],
      "published_date_precision": "full",
      "published_date": "2024-10-28 11:03:48",
      "updated_date": "2024-10-28 11:03:48",
      "updated_date_precision": "full",
      "is_opinion": false,
      "twitter_account": "@DandC",
      "domain_url": "democratandchronicle.com",
      "parent_url": "<https://www.democratandchronicle.com/sports>",
      "word_count": 357,
      "rank": 5339,
      "country": "US",
      "rights": "democratandchronicle.com",
      "language": "en",
      "nlp": {
        "theme": [
          "Sports"
        ],
        "summary": "Your vote will determine who will...(ai summary truncated)",
        "sentiment": {
          "title": 0.0,
          "content": 0.0
        },
        "ner_PER": [
          {
            "entity_name": "Governor",
            "count": 1
          },
          ...list truncated for illustration...
        ],
        "ner_ORG": [
          {
            "entity_name": "Section V",
            "count": 2
          },
          ...list truncated for illustration...
        ],
        "ner_MISC": [
          {
            "entity_name": "Girls Sports Athlete of the Week",
            "count": 1
          },
          ...list truncated for illustration...
        ],
        "ner_LOC": [
          {
            "entity_name": "Silver Hill Tech Park",
            "count": 1
          },
          ...list truncated for illustration...
        ]
      },
      "paid_content": false
    },
    ...list truncated for illustration...
  ],
  "user_input": "...object showing the input..."
}

In the above result, we can see data added from each of the pipeline components we described earlier. Hence, the data consists of original article metadata such as title, author, publication date, etc., and enrichments such as recognized entities and sentiment scores, and also the detected locations.

Scaling and Optimizing the Performance of the API

News data is very dynamic; fresh data is coming in every minute. Handling huge volumes of fresh data and keeping our servers performant and reliable, while still processing and providing access to the latest news was surely challenging. We started with a plain ElasticSearch setup and quickly realized that querying thousands of locations per minute was resource-intensive and unsustainable.

We had to change our approach. So, instead of querying ElasticSearch for each query each time, we implemented a consumer-based model, where incoming articles are directly passed to a service that separately handles the queries pertaining to the corresponding location. We also implemented a reverse search system, where we have a list of incoming queries and we map fresh incoming articles to these queries. When the incoming query is used in an API call by our customer, we simply return the already mapped articles, instead of running a fresh text search. This approach significantly minimizes the response time and keeps our API responsive even when there is a high load. Therefore, we can call our Local News API near real-time.

Testing, QA, and SLAs

All components of our API are subjected to rigorous tests using the unittest Python module. We perform all-round testing, by testing each component in isolation, testing the interactions between components, and finally testing the whole system end-to-end, simulating real-world scenarios. Doing this regularly has helped us identify and fix issues in a timely manner, ensuring that our API is stable and robust. This helps us offer SLAs (Service Level Agreements) for up to 99.95% API uptime and up to 99% per source coverage to our customers.

We use Git for version control, tracking changes, collaborating effectively, and maintaining a detailed history of updates. We also implement API versioning to support backward compatibility - so there are no surprises to our customers when we upgrade our API. They can continue using the older version and upgrade whenever they would like to.

Use Cases for the API

The most obvious use case for our Local News API has been to add a local news feed in consumer apps. Its a quite engaging feature to have in any app that operates in the hyperlocal space. In the business space, local news can be quite helpful to the real estate sector, and chain businesses (such as fast food chains) looking to setup shop in new locations. For investors investing in local assets, it can be helpful in discovering opportunities and analyzing risks. We are sure there are several use cases that we might have missed mentioning, and if you have an idea of how you want to use our Local News API, reach out to us for access.

Roadmap for Improvements

As much as we’re excited to roll out Local News API for English articles in US cities, we’re working on improving and expanding this product. First on the cards is address detection - for more granular locality associations. In addition, we’re working to support languages other than English, starting with Spanish. We’re also looking to expand this API to other countries based on demand.

Choosing the Right News API Should Be Easy

Get access to the guide that simplifies your decision-making. Enter your email to download now.

Text Link
Success! Your white paper is on its way. Be sure to check your inbox shortly!
Oops! Something went wrong while submitting the form.

READY FOR
CUSTOM NEWS SOLUTIONS?

Drop your email and find out how our API delivers precisely what your business needs.