One of the significant uses of news data is to track events of particular interest. For example, venture capitalists and financial investors can use the news to monitor market events or receive updates about their portfolio items. Administrators can track events in the news to study and act on issues of public interest, such as crimes and infrastructure lapses.
Transparency International UN is a global coalition fighting against corruption. They work with various stakeholders such as policymakers and the people, to strengthen anti-corruption laws, and enable reporting of corruption. A significant product of their work is the Corruption Perception Index (CPI) which ranks 180 countries based on their perceived levels of public sector corruption. Media monitoring is a key component of the research that goes into their work, and they use the NewsCatcher Events API to detect events pertaining to ‘acts of corruption’ from the hundreds of thousands of news articles published every day.
Extracting event information from news data involves using LLMs and Natural Language Processing techniques, to get structured data from a free text news article. In the context of corruption, this involves specific information fields such as accused parties, the timeline of the event, and the amounts involved. With NewsCatcher, all this extraction work is done on our end and readily available via an API.
In this tutorial, let’s look at how we can detect events from news articles using NewsCatcher’s Events Intelligence API. We’ll be detecting ‘acts of corruption’ events as an example. All the code snippets are presented in Python, but they will work with any programming language via a REST API.
Basic Setup
For this tutorial, we’ll need the following:
- NewsCatcher API Endpoint: Base URL for all API requests
- NewsCatcher API Key: For authentication, using the
x-api-token
HTTP header - A Python (3.x) installation with requests library installed.
Let’s put these at the start of the code:
import requests
import json
NC_ENDPOINT = '<newscatcher-api-endpoint>'
NC_API_KEY = '<newscatcher-api-key>'
We also imported the built-in json
module to pretty print and view the JSON outputs.
Getting Subscription Summary
NewsCatcher provides a subscription summary API endpoint, to conveniently view the status of our subscription. This includes the number of API calls assigned and remaining, and what events are available to us, among other things. To call this endpoint, we need to send a GET
request to /api/subscription/
:
r = requests.get(
f'{NC_ENDPOINT}/api/subscription/',
headers={'x-api-token': NC_API_KEY},
)
print(json.dumps(r.json(), indent=2))
This code prints the below output resulting from the API call:
{
"active": true,
"calls_per_seconds": 5,
"plan_name": "events",
"usage_assigned_calls": 25000,
"usage_remaining_calls": 24988,
"additional_info": {
"allowed_events": [
"act_of_corruption"
]
}
}
In the output, we can see the various fields related to our subscription, such as whether it is active or not, the rate limit, and the number of calls assigned and remaining. We also have an additional_info
field, that mentions the allowed_events
we have access to, and it includes acts_of_corruption
. We’ll use this event type in the further steps to get the events from the API.
Getting Event Fields
Before we get the events, NewsCatcher provides an endpoint to observe what fields will be provided for each event in the API. This will help us structure events search API calls better. We can check which fields available for our event type by sending a GET
request to /api/events_info/get_event_fields
. We’ll pass the event_type
as a query parameter with the value act_of_corruption
. Let’s look at the code:
r = requests.get(
f'{NC_ENDPOINT}/api/events_info/get_event_fields',
headers={'x-api-token': NC_API_TOKEN},
params={'event_type': 'act_of_corruption'},
)
print(json.dumps(r.json(), indent=2))
This gives us the following output:
{
"message": "Success",
"count": 16,
"fields": {
"act_of_corruption.accused_parties": {
"type": "String",
"usage_example": {
"act_of_corruption.accused_parties": "String Example"
}
},
...more fields...
"event_date": {
"type": "Date",
"usage_example": {
"event_date": {
"lte": "now",
"gte": "now-1d"
}
}
},
"extraction_date": {
"type": "Date",
"usage_example": {
"extraction_date": {
"lte": "now",
"gte": "now-1d"
}
}
}
}
}
From the above output, we see that 16 fields are available. These include various structured data fields such as event date, extraction date, monetary amounts involved, and accused and victim parties. We use these fields to search and filter events when we use the event search API. We also see some usage examples in the outputs, and for date fields, we see that we can use convenient strings such as now
and now-1d
instead of providing the exact date strings.
Searching for Events
Finally, let’s look at how to search for events. For this, we’ll be sending POST
requests to /api/events_search/
including our search parameters as JSON body:
A Basic Query
r = requests.post(
f'{NC_ENDPOINT}/api/events_search/',
headers={'x-api-token': NC_API_KEY},
json={
'event_type': 'act_of_coruuption',
'attach_articles_data': False,
'additional_filters': {
'extraction_date': {
'gte': 'now-6d',
'lte': 'now',
},
},
}
)
The above code will get us a list of events from the API:
{
"message": "Success",
"count": 10000,
"events": [
{
"id": "000WkJIBvyT_ytpRkgh3",
"event_type": "act_of_corruption",
"global_event_type": "Corruption",
"associated_article_ids": [
"9bdc9053107df44cd7116b9be05bcdfb"
],
"extraction_date": "2024-10-15 17:40:50",
"event_date": null,
"company_name": null,
"act_of_corruption": {
"summary": "Democratic Party MP Jorida Tabaku criticizes the Albanian government for turning the country into a waste bin, highlighting corruption in waste management and incineration projects in Tirana, Fier, and Elbasan. She accuses the government of promoting incineration while ignoring EU policies on plastic bags and turning Albania into a waste bin of Europe. The Ministry of Tourism and Environment has initiated an initiative to import waste, and there are allegations of 28 million euros ending up in thieves' bank accounts.",
"accused_parties": [
"Albanian government",
"Thieves involved in waste management contracts"
],
"how_much_related": "Very Good",
"dominant_category": "corruption",
"democracy_category": [
"Political Financing and Influence"
],
"victim_parties": [
"Citizens of Albania"
],
"corruption_category": [
"Misappropriation"
],
"location": [
{
"country": "Albania",
"city": "Tirana",
"raw_location": "Tirana, Albania",
"county": "Tirana County",
"state": "Tirana"
},
...more locations
],
"industry": [
"Democracy",
"Other"
],
"monetary_value_currency": "EUR",
"monetary_value_amount": "28000000"
}
},
...more events
]
}
The above snippet shows some truncated output showing the fields returned, with one event data point. Along with the event data points, the result gives us a message: "Success"
field indicating that the API call was successful, along with count
fields showing the number of event data points returned.
Using Filters
We can use all the fields we saw in the earlier event_fields
output. Say we want to get only the events where the victim parties are “Citizens of India”. We can add the corresponding filter as follows:
r = requests.post(
f'{NC_ENDPOINT}/api/events_search/',
headers={'x-api-token': NC_API_KEY},
json={
'event_type': 'act_of_coruuption',
'attach_articles_data': False,
'additional_filters': {
'extraction_date': {
'gte': 'now-6d',
'lte': 'now',
},
'act_of_corruption.victim_parties': 'Citizens of India',
},
}
)
This gives us the below output:
{
"message": "Success",
"count": 7,
"events": [
{
"id": "AEyzj5IBvyT_ytpRHcUA",
"event_type": "act_of_corruption",
"global_event_type": "Corruption",
"associated_article_ids": [
"455197a19538328cfcc1546f19b1874e"
],
"extraction_date": "2024-10-15 15:52:12",
"event_date": "2023-08-15 00:00:00",
"company_name": null,
"act_of_corruption": {
"summary": "The Bharatmala Pariyojana, a road infrastructure project in India, is being built with significant deficiencies, non-compliance of outcome parameters, clear violation of tender bidding process, and huge funding mismanagement. The Atal Setu, a new trans-harbour link in Mumbai, built at a cost of Rs 18,000 crore, has become a mess of craters in its first monsoon, highlighting the poor quality of construction.",
"accused_parties": [
"Government of India",
"Bharatmala Pariyojana"
],
"how_much_related": "Good",
"dominant_category": "corruption",
"victim_parties": [
"Citizens of India"
],
"corruption_category": [
"Misappropriation"
],
"location": [
{
"country": "India",
"city": "Mumbai",
"raw_location": "Mumbai, Maharashtra, India",
"county": "",
"state": "Maharashtra"
}
],
"industry": [
"Other"
],
"monetary_value_currency": "INR",
"monetary_value_amount": "18000000000"
}
},
...more events
]
}
With this additional filter, we get only 7 data points where the victim parties are ‘Citizens of India’.
Getting the News Sources
For any data to be cited or presented, we always need the source of the data. So, NewsCatcher gives us an option to get the original article in which the event was detected. To use this option, we need to set attach_articles_data
to true
in our JSON payload:
r = requests.post(
f'{config.NC_ENDPOINT}/api/events_search/',
headers={'x-api-token': config.NC_API_KEY},
json={
'event_type': 'act_of_corruption',
# setting below option to 'True'
'attach_articles_data': True,
'additional_filters': {
'extraction_date': {
'gte': 'now-7d',
'lte': 'now'
},
'act_of_corruption.victim_parties': 'Citizens of India',
}
}
)
print(json.dumps(r.json(), indent=2))
The above code gives us the list of events with the source article data attached. Let’s see what a single event’s JSON looks like with this:
{
"id": "JEx8j5IBvyT_ytpR4ZWK",
"event_type": "act_of_corruption",
"global_event_type": "Corruption",
"associated_article_ids": [
"2757c6c80cdb80caf7032ddb06f031c9"
],
"extraction_date": "2024-10-15 14:52:58",
"event_date": "2024-02-15 00:00:00",
"company_name": null,
"act_of_corruption": {
"summary": "The controversy involves allegations against Union Finance Minister Nirmala Sitharaman and senior BJP leaders for orchestrating an extortion racket through the electoral bond scheme, amassing over \u20b98,000 crore in ill-gotten gains. An FIR was registered against them for extortion, criminal conspiracy, and common intention.",
"accused_parties": [
"Nirmala Sitharaman",
"Nalin Kumar Kateel",
"B.Y. Vijayendra"
],
"how_much_related": "Very Good",
"dominant_category": "corruption",
"democracy_category": [
"Elections"
],
"victim_parties": [
"Citizens of India"
],
"corruption_category": [
"Extortion"
],
"location": [
{
"country": "India",
"city": "Bengaluru",
"raw_location": "Special Court for People's Representatives in Bengaluru",
"county": "",
"state": "Karnataka"
}
],
"industry": [
"Democracy"
],
"monetary_value_currency": "INR",
"monetary_value_amount": "80000000000"
},
"articles": [
{
"link": "https://spoindia.org/electoral-bond-controversy-congress-demands-nirmala-sitharamans-resignation-over-fir",
"id": "2757c6c80cdb80caf7032ddb06f031c9",
"media": "https://spoindia.org/wp-content/uploads/2023/05/cropped-spo-32x32.jpg",
"title": "Electoral Bond Controversy: Congress Demands Nirmala Sitharaman's Resignation Over FIR"
}
],
}
We can see that the source article from which the event was extracted has been added to the output above. A link, cover media URL and a title have been provided for the article.
Conclusion
In this tutorial, we looked at how to use the NewsCatcher Events Intelligence API to detect events from news data. We did this in three steps:
- Used the
GET: /api/subscription/ endpoint
to see which type of events are enabled for us to access - Used the
GET: /api/events_info/get_event_fields/
endpoint to see what fields are available for the selected event type, to use for filtering and searching events. - Used the
POST: /api/events_search/
to search for events with filters.
We also looked at how to use filters in the search and how to get the source articles from which the events were extracted.
The NewsCatcher Events Intelligence API is a handy feature that can be used to detect events of interest from thousands of news articles. The backend does all the heavy lifting of parsing the events from the articles, so you can directly proceed with your analysis of events. To get access to the API, visit the pricing page.