Clustering news articles

Introduction

Imagine walking into a massive library where all the books are scattered randomly across the floor. Finding related information would be a nightmare, right? That’s often what it’s like when dealing with large volumes of news data. Enter clustering - a powerful feature in News API v3 that acts like a team of lightning-fast librarians, instantly organizing articles into meaningful groups.

Clustering is available for all languages supported by News API v3. This allows you to group similar articles across the entire multilingual database.

What is clustering?

Clustering is an advanced process that goes beyond simple keyword matching. It uses sophisticated language processing to understand the content and context of each article, grouping related pieces together even if they use different words to describe the same concepts.

Here’s what clustering does for you:

Reveals connections between articles, helping you spot trends and patterns in large volumes of news data.
Simplifies analysis of how different sources cover the same story.
Saves time by automatically organizing information into coherent groups.
Provides a clearer picture of the news landscape, making it easier to track evolving stories and identify emerging topics.

By leveraging clustering in News API v3, you transform a chaotic flood of information into a structured, insightful resource, enabling more efficient and effective news analysis.

How does it work?

Our clustering system uses a streamlined process to group similar articles based on their semantic similarity. The clustering process occurs dynamically at the API level, taking into account the search filters you apply. This means you get clusters that are tailored to your specific query, not just generic groupings.

Embeddings generation

The foundation of our clustering process is the creation of article embeddings. These embeddings capture the semantic meaning of the content - not just the words but the ideas behind them.

Think of these embeddings as creating a unique fingerprint for each article based on its content.

Similarity calculation

When you make a request that includes clustering, we use these pre-generated embeddings to group similar articles:

We compare the embeddings of different articles using cosine similarity.
This gives us a score that tells us how similar articles are in terms of their content and meaning.

Cluster formation

Based on the similarity scores, we form clusters:

Articles with a similarity score above our clustering threshold get grouped into clusters.
Each cluster gets a unique identifier, so you can easily refer to it later.

How to use clustering

Clustering is only available for the Search and Latest Headlines endpoints.

Enable clustering

To activate clustering and fine-tune its behavior, use the following parameters in your API request:

clustering_enabled (boolean): Set to true to enable clustering.
clustering_threshold (float): Determines how similar articles need to be to end up in the same cluster. Values range from 0 to 1, with higher values resulting in clusters with more similar articles. The default value is 0.6.
clustering_variable (string): Chooses which part of the article to use for clustering. Options are content (default), title, or summary.

Optimize clustering with page size

An important consideration when using clustering is the page_size parameter. Clustering operates on one page of results at a time, affecting how articles are grouped. To ensure the most effective clustering:

Set page_size to a value greater than your expected total_hits.
This allows all relevant articles to be considered for clustering together.

For example, if your query is likely to return 150 articles, set page_size to at least 150. This prevents related articles from being split across different pages and, thus, different clusters.

Understand API response

When you enable clustering, your API response will include some new elements:

clusters_count: The total number of clusters found
clusters: An array of cluster objects, each containing:
- cluster_id: A unique identifier for the cluster
- cluster_size: The number of articles in the cluster
- articles: An array of the articles in the cluster

Code example

Here’s how you might use clustering in a Python script:

clustering.py

import requests
import os
import json

# Configuration
API_KEY = "YOUR_API_KEY_HERE"  # Replace with your actual API key
URL = "https://v3-api.newscatcherapi.com/api/search"
HEADERS = {"x-api-token": API_KEY, "Content-Type": "application/json"}
PAYLOAD = {
    "q": "renewable energy",
    "lang": "en",
    "clustering_enabled": True,
    "clustering_variable": "content",
    "clustering_threshold": 0.9,
}

try:
    # Fetch articles using POST method
    response = requests.post(URL, headers=HEADERS, json=PAYLOAD)
    response.raise_for_status()  # Check if the request was successful

    # Print the raw JSON response
    print(json.dumps(response.json(), indent=2))

except requests.exceptions.RequestException as e:
    print(f"Failed to fetch articles: {e}")

In the response, you’ll see how articles are grouped into clusters. Here’s a snippet of what you might get back:

response.json

{
  "status": "ok",
  "total_hits": 10000,
  "page": 1,
  "total_pages": 100,
  "page_size": 100,
  "clusters_count": 65,
  "clusters": [
    {
      "cluster_id": "7222464423361803386",
      "cluster_size": 11,
      "articles": [
        {
          "title": "Why Is NextEra Energy (NEE) The Best Alternative Fuel Stock To Buy Right Now?",
          "author": "Mashaid Ahmed",
          "published_date": "2024-08-25 17:36:01"
          // ... other article details ...
        }
        // ... other articles in the cluster ...
      ]
    }
    // ... other clusters ...
  ]
}

This shows that the API found 65 clusters, with one cluster containing 11 articles about NextEra Energy and alternative fuel stocks.

Use cases

Clustering can be a game-changer in various scenarios. Here are some common use cases:

Trend identification: Quickly spot emerging trends by analyzing large clusters of articles on similar topics, giving you a bird’s-eye view of the news landscape.
Diverse perspectives analysis: Examine how different sources cover the same story within a cluster, providing a comprehensive view of news events from various angles.
Content organization: Efficiently organize large volumes of news content into meaningful groups, as if you had a personal librarian instantly categorizing your articles.
Story evolution tracking: Follow how news stories develop over time by analyzing changes in cluster composition and size, watching stories grow, merge, or fade away in real-time.
Enhanced search capabilities: Improve search results by grouping related articles together, allowing users to quickly find relevant information with context-aware precision.

Clustering vs deduplication

While both clustering and deduplication help organize large sets of articles, they serve different purposes:

Feature	Clustering	Deduplication
Purpose	Groups similar articles	Removes nearly identical articles
Content	Retains all articles	Removes duplicates
Similarity Threshold	Generally lower, allowing broader groups	Higher, identifying near-exact matches
Output	Groups of related articles	Set of unique articles
Use Case	Analyzing related content, tracking trends	Eliminating redundancy, ensuring uniqueness

Choose clustering when you want to analyze related content and track trends. Go for deduplication to eliminate redundancy and ensure uniqueness in your article set.

For more information on our deduplication feature, check out Articles deduplication.

Wrapping up

Clustering in News API v3 is like having a smart assistant that can quickly organize mountains of news data into meaningful groups. Whether you’re tracking trends, analyzing diverse perspectives, or just trying to make sense of the news firehose, clustering can help you see the forest for the trees.

We encourage you to try clustering in your News API v3 queries and see how it can enhance your news analysis. As always, we’re here to help if you have any questions or need assistance using this feature. Happy clustering!

Get started

Guides and concepts

Migration

How to

Troubleshooting

Clustering news articles

Introduction

What is clustering?

How does it work?

Embeddings generation

Similarity calculation

Cluster formation

How to use clustering

Enable clustering

Optimize clustering with page size

Understand API response

Code example

Use cases

Clustering vs deduplication

Wrapping up

Get started

Guides and concepts

Migration

How to

Troubleshooting

​Introduction

​What is clustering?

​How does it work?

​Embeddings generation

​Similarity calculation

​Cluster formation

​How to use clustering

​Enable clustering

​Optimize clustering with page size

​Understand API response

​Code example

​Use cases

​Clustering vs deduplication

​Wrapping up

Introduction

What is clustering?

How does it work?

Embeddings generation

Similarity calculation

Cluster formation

How to use clustering

Enable clustering

Optimize clustering with page size

Understand API response

Code example

Use cases

Clustering vs deduplication

Wrapping up