News API
Blogs
Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines
Tutorial

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

We ran a study comparing NewsCatcher news pipelines with alternative methods of gathering news data. Check the results.

Comparing News Data Search: LLMs, Analyst, and NewsCatcher Pipelines

Introduction

If you have stumbled upon NewsCatcher API or this blog, you’re probably interested in gathering news data at scale. This could be for tracking events for market analysis, keeping track of your brand/company in the media, or monitoring local news to prepare for any supply chain disruptions. You might wonder why you’d need a news API in 2025—when you could hire analysts to search manually or simply use AI.

At NewsCatcher, we asked ourselves the same question. To obtain an answer, we decided to run a comparative study for some common news data-gathering tasks. We ran 3 tasks, and for each task, we compared the following methods:

  • GPT 4-o mini, with search
  • Claude Sonnet with computer use
  • Perplexity Pro
  • A Human Analyst with Google Search
  • NewsCatcher Pipelines

For each task, we looked at two important performance metrics: time taken and number of relevant articles obtained. Below, we’ll go over what each of these tasks are and how the different approaches fared.

TL; DR: In 2 of the 3 tasks, NewsCatcher fetched the most number of articles relevant to the task. In the other task, NewsCatcher was only slightly trailing behind the human analyst in terms of coverage. The LLMs made it past the 10% coverage mark in just one task, with a best coverage score of around 37%.

Task 1: Event Tracking

One of the common use cases of news monitoring is to obtain news articles mentioning a certain kind of event. For this task, we set out to track all news about corporate HQ changes, which include opening, closing, and relocation. We chose to do this across the USA between the dates 1st to 8th January 2025.

For the AI search tools, we formulated a comprehensive prompt:

Prompt:

You are tasked with searching the internet to identify news articles related to corporate headquarters changes announced within a specific timeframe. Your goal is to extract relevant information and compile it into a CSV file. Follow these instructions carefully:

Search Criteria:

Use keywords indicating announcements of HQ changes (e.g., 'announced new HQ', 'relocating headquarters', 'planning HQ shift')
Focus on articles primarily discussing corporate HQ changes in the U.S.
Exclude articles related to government or non-private HQs and multiple companies in the same mention
Search Constraints:

Language: English
Region: United States
Timeframe(DD-MM-YY): 01-January-25 to 08-January-25
Information to Extract:

Company Name
Location Details:
City
County
State (full name)
Country (full name)
Raw Location (if available)
Announced Date (YYYY/MM/DD or higher precision)
Type of HQ Change (e.g., 'new', 'closed', 'relocated')
Area Size in Square Feet (if specified)
Raw Area Size (if sqft is unavailable)
Number of Employees Affected (if mentioned)
Concise summary (up to 500 characters)

Validation Criteria:
Ensure relevance to announcements of corporate HQ changes (not personal or general reports)
Assess level of relatedness to potential real estate or business implications
Output Format: Your final output should be a structured CSV file that includes the extracted fields. Format it similarly to this example:

Company Name,City,County,State,Country,Raw Location,Announced Date,Type of HQ Change,Area Size in Square Feet,Raw Area Size,Number of Employees Affected,Summary

Web Browsing Instructions: a. Use the web browsing functionality to search for relevant news articles within the specified timeframe. b. Open and read each relevant article to extract the required information. c. If an article doesn't meet the criteria or lacks sufficient information, move on to the next one.

Compiling the CSV: a. Start with a header row containing all the column names listed in the Output Format section. b. For each unique HQ change event, create a new row in the CSV file. c. Populate each cell with the corresponding information extracted from the article(s). d. Ensure all cells contain valid data or "N/A" if the information is not available. e. Double-check that the CSV file is properly formatted and readable.

Final Notes:
Exclude articles with publication dates outside the specified timeframe, even if they mention HQ changes occurring within the timeframe.
Ensure the CSV is well-structured, accurate, and free of duplicate entries.
If you encounter any challenges during the extraction or data organization process, note them at the end of your response.
When you have completed the task, provide a summary of your findings, including the number of relevant articles found and any notable trends or challenges encountered.

Let’s see how things fared in a table:

Method Initial Time Spent Time spent each time No of events Coverage (total = 75) Real-time (ness)
SearchGPT (4o) 30-45 mins for prompt engineering with feedback from the LLM 5-10 mins if the LLM cooperates;
upto another 30 mins if it doesn’t comply
0 or 1 1% updated when you run/search it
Perplexity Pro 30-45 mins for prompt engineering with feedback from the LLM 5-10 mins if the LLM cooperates;
upto another 30 mins if it doesn’t comply
5 6% updated when you run/search it
Claude Sonnet with Computer Use 30 mins on the initial set-up
45 mins on the prompt engineering and testing
1.5 hours for the Agent to do it’s thing with manual intervention here and there. 5
(after multiple nudges for more results)
6% updated when you run/search it
Human Analyst with browser 5-10 minutes for defining keywords 3 - 4 hours of manually browsing and putting things in a spreadsheet 13
(after this Google only showed non-US results which are outside the scope of the task)
17% updated when you run/search it
NewsCatcher local news pipeline 3-6 hours for initial onboarding where we define the event based on input from the client < 1 minute
just download the file from the email, S3 bucket or drive
73 97.3% updated in the background, all the time

From the table above, we can see that AI LLMs were barely suitable for this task, returning 0-5 of 75 results, that is, if they complied, and also after nudging for more results multiple times. The total time taken was around 1-2 hours, which is comparable to the human analysts with Google search, who returned 13 results in around 3-4 hours.

With the NewsCatcher pipeline, we spent 3-6 hours on the initial event definition, which yielded 73 results in less than a minute. A point worth noting is that the initial definition is only for the first run of the pipeline. A successive run can skip this step and return the results within a minute. NewsCatcher pipelines also keep running on our servers, unlike the other methods, which run only on demand. These results undoubtedly make NewsCatcher an efficient method for news-based events intelligence applications, for monitoring the market and adapting your business strategies accordingly. You can see some hands-on examples demonstrating our Events Intelligence API on our blog: Detecting Events in News Using NewsCatcher’s Events Intelligence API

Task 2: Local News Monitoring

Local news is another popular use case for news data. We have multiple customers who use local news data to provide news data in engaging consumer apps or to track local assets in their investment portfolios. Running naive keyword-based searches for this use case becomes tricky, as sometimes the city/town names could be common words, or multiple towns can have the same name. With these challenges in mind, we tasked our methods to get local news from the small city of Arab in Alabama (US) for the one week between the 8th and 15th of January 2025.

For LLMs, we used the following prompt:

Prompt:

You are tasked with browsing the internet to gather news articles about a specific city and compile the information into a CSV file. Follow these instructions carefully:

  1. You will be searching for news articles about the city of {{CITY_NAME}} in the state of {{STATE_NAME}}, {{COUNTRY_NAME}}, published within the last {{TIME_FRAME}}.
  2. Use a web browsing tool to search for news articles. Utilize reputable news sources and search engines. Make multiple searches if necessary to ensure comprehensive coverage.
  3. For each article found, verify that it meets the following criteria: a. The article explicitly mentions {{CITY_NAME}}, {{STATE_NAME}}. b. The article is about the correct {{CITY_NAME}}. Verify that it's not about a similarly named city in another state or country. c. The article was published within the specified time frame of {{TIME_FRAME}}.
  4. For each relevant article, collect the following information: a. Title: The full title of the article. b. Link: The complete URL to the article. c. Publish Date and Time: The exact date and time when the article was published, in a consistent format (e.g., YYYY-MM-DD HH:MM:SS).
  5. Organize the collected data into a CSV file with the following columns: a. Title b. Link c. Publish Date and Time
  6. If no articles are found that meet the criteria, create a CSV file with the same columns but include a single row with the following information: a. Title: "No relevant articles found" b. Link: "N/A" c. Publish Date and Time: Current date and time
  7. If you encounter any errors during the process (e.g., inability to access certain websites), note these in a separate "Errors" column in the CSV file.
  8. Once you have completed the task, provide a summary of your findings, including: a. The total number of articles found b. Any challenges or notable observations during the search process c. A confirmation that the CSV file has been generated

Remember to adhere strictly to these instructions and only use the web browsing tools provided to you. Do not fabricate any information or use external knowledge not obtained through the web search.

Let’s see the results table:

Method Initial Time Spent Time spent each time No of events Coverage (total = 10) Real-time (ness)
SearchGPT (4o) 30-45 mins for prompt engineering with feedback from the LLM 5-10 mins if the LLM cooperates;
upto another 30 mins if it doesn’t comply
0 0% updated when you run/search it
Perplexity Pro 30-45 mins for prompt engineering with feedback from the LLM 5-10 mins if the LLM cooperates;
upto another 30 mins if it doesn’t comply
0 0% updated when you run/search it
Claude Sonnet with Computer Use 30 mins on the initial set-up
45 mins on the prompt engineering and testing
1 hour for the Agent to do it’s thing with manual intervention here and there. 4 4% updated when you run/search it
Human Analyst with browser 5 minutes for defining keywords and checking if google has a location topic for the city 2 hours of manually browsing and putting things in a spreadsheet 2 20% updated when you run/search it
NewsCatcher local news pipeline 5 mins < 1 minute
just fetch it from our API, or we can send the clients timely dumps
9 90% updated in the background, all the time

Starting with the LLMs, SearchGPT and Perplexity returned zero relevant results, sometimes returning results from the Middle East because the city is called Arab, for example:

Arab soldier killed in Korean War brought home, laid to rest decades later

Surprisingly, Claude Sonnet with computer use returned 4 results and performed better than the human analyst who could find only 2 results. With the analysts’ Google searches, most results we could get were about nearby, bigger town called Cullman in Alabama. When Alabama was removed from the search term, we obviously got results from the Middle East instead.

NewsCatcher’s pipeline was way ahead of the competition, returning 9 results of the total 10 relevant results we were expecting. While the other methods took 1-2 hours, NewsCatcher produced the results in about 6 minutes, as we’re constantly monitoring thousands of news sources and identifying mentions of cities/towns in articles as soon as we detect that they’re published. This makes NewsCatcher highly suitable for local news monitoring. You can read more about this offering on our blog: How Does Our Local News API Work?

Task 3: Entity Resolution

Entity Resolution or Entity Disambiguation is a Natural Language Processing technique used to distinguish different entities with the same name or keyword. In news data analysis, it becomes important to filter out irrelevant articles for a given keyword (e.g., Riot Games the company vs actual riots). This is particularly useful for monitoring news about company entities.

For this study, the task was to fetch the news articles mentioning the company Blaize (blaize.com) published in the one month between 15th December 2024 and 15th January 2025. For LLMs, we used the following prompt:

Prompt:

You are an AI assistant tasked with creating a CSV file containing information about articles that mention a specific company. This task is crucial for media monitoring and brand tracking purposes. Please follow these instructions carefully to complete the assignment.

First, here are the key details about the company you'll be researching:

Company Name:

{{company_name}}

Company Domain URL:

{{domain_URL}}

Now, let's break down the task into steps:

  1. Search Parameters:
    • Search for articles mentioning the company name or domain URL.
    • Date range: December 15, 2024, to January 15, 2025.
  2. Article Search:
    • Use multiple search queries to ensure comprehensive results.
    • Utilize the computer use agent for all search-related actions.
  3. CSV File Creation: Create a CSV file with the following columns: a. Title b. URL c. Publish Date and Time d. Summary e. Relevance Justification
  4. For each relevant article: a. Extract the title, URL, and publish date/time. b. Read the article and create a brief summary (2-3 sentences) focusing on company-related information. c. Analyze and justify the article's relevance to the specified company.
  5. CSV File Management:
    • Add each article's information to the CSV file.
    • Use the computer use agent to save the file as "company_name_articles_Dec2024_Jan2025.csv".
  6. Quality Check:
    • Ensure all articles in the CSV are within the specified date range.
    • Verify that all entries are relevant to the company.
  7. Summary Report: After completing the CSV file, provide a brief summary of your findings.

Before you begin, outline your approach in tags, considering:

  • List potential search queries for finding relevant articles.
  • Outline a strategy for efficiently reading and summarizing articles.
  • Consider potential challenges in determining article relevance and how to address them.
  • Plan how to tackle each step of the process systematically.

Example CSV structure:

Title,URL,Publish Date and Time,Summary,Relevance Justification
"Example Article Title","<https://example.com/article","2024-12-20> 14:30:00","This is a brief summary of the article's main points related to the company.","This article is relevant because it mentions the company's new product launch and quotes the CEO."

Once you have completed the task, provide a summary of your findings inside

tags. Include the number of articles found and any notable trends or themes you observed.

We were expecting 11 results to be found, and let’s see how many results we could get with each of our methods:

Method Initial Time Spent Time spent each time No of events Coverage (total = 11) Real-time (ness)
GPT 4-o3-mini with Search 30-45 mins for prompt engineering with feedback from the LLM 5-10 mins if the LLM cooperates;
upto another 30 mins if it doesn’t comply
3 18.1% updated when you run/search it
Perplexity Pro 30-45 mins for prompt engineering with feedback from the LLM 5-10 mins if the LLM cooperates;
upto another 30 mins if it doesn’t comply
2 27.2% updated when you run/search it
Claude Sonnet with Computer Use 30 mins on the initial set-up
45 mins on the prompt engineering and testing
15 minutes for the Agent to do its thing with manual intervention here and there. 4 36.3% updated when you run/search it
Human Analyst with browser 5 minutes for defining keywords and checking if google has a location topic for the city 2 hours of manually browsing and putting things in a spreadsheet 11 100% updated when you run/search it
NewsCatcher local news pipeline 5 mins < 1 minute
just fetch it from our API, or we can send the clients timely dumps
9 81.8% updated in the background, all the time

From the results above, we can see that LLMs were not very effective here, either. We get just 2-4 of 11 results. Contrary to previous tasks, a human analyst searching on Google actually got all the 11 expected results, while NewsCatcher’s Entity Resolution pipeline wasn’t far behind, getting 9 results. While the coverage is comparable, NewsCatcher takes only around 5 minutes, while the analyst takes around 2 hours. This makes NewsCatcher a worthy option for Entity Resolution and Company news monitoring.

Company news can give valuable insights about how well your brand is doing on media mentions and also flag any negative coverages for you to act upon. If not your own company, you might find this useful to monitor your competitors and their activities in the market. If you’re in the finance and investment sector, you might also be interested in monitoring companies that are in your investment portfolio.

Conclusion

In this blog, we ran a study comparing NewsCatcher news pipelines with alternative methods of gathering news data, i.e., LLM-based search tools and a human analyst doing Google searches on a browser. The results were pretty clear: NewsCatcher provides 80-100% coverage in terms of finding the articles relevant to the task, and most importantly, it does so in a time-efficient manner. This is possible because dedicated news data ingestion and analysis pipelines are constantly running on our backend, and you can simply call an API to highly relevant news data as per your needs. Sign up to try NewsCatcher API today.

Choosing the Right News API Should Be Easy

Get access to the guide that simplifies your decision-making. Enter your email to download now.

Download white paper
Success! Your white paper is on its way. Be sure to check your inbox shortly!
Oops! Something went wrong while submitting the form.

READY FOR
CUSTOM NEWS SOLUTIONS?

Drop your email and find out how our API delivers precisely what your business needs.