THE TECH DRIVING OUR API
Discover how our API processes data to deliver unmatched insights.
DISCOVER:
HOW IT WORKS
Intelligent Scheduling Algorithm
Our process begins with a proprietary scheduling algorithm that monitors the publication frequencies of different sources over a week. This data informs our crawlers, allowing us to efficiently gather new article links without overwhelming system resources. This method ensures an optimal balance between timeliness and resource utilization.
Data Acquisition
We fetch and store the raw webpage for each article link. This archival strategy provides the flexibility to enhance data extraction methods retrospectively as new techniques become available, ensuring continuous improvement in data quality.
Extraction Techniques
We utilize five distinct extraction methods to retrieve article data, including two advanced adaptations of open-source technologies and three proprietary techniques developed in-house. This diverse toolkit enables us to handle a wide range of article formats and data types effectively.
Data Integration & Deduplication
After extraction, data from different sources is integrated into a unified article format. Our system applies advanced deduplication techniques, ensuring that each article is unique and consistently formatted, using a combination of URL and an internally generated ID based on various data points. The extraction process particularly focuses on the accuracy of the full article text, publish dates, and author details.
Data Cleaning
The next phase involves a comprehensive data cleaning process. We use a detailed directory of patterns to identify and remove irrelevant information. This meticulous approach significantly enhances the quality of the information.
NLP Pipeline
Cleaned articles are processed through an advanced Natural Language Processing (NLP) pipeline. This stage includes summarizing the content, classifying articles into broad news topics, detecting named entities, and assessing sentiment. This enriches the articles, making them more actionable and insightful for users.
Indexing & Distribution
Processed articles are indexed in our main production ES clusters for querying. We also distribute specific datasets to dedicated client clusters and shared cloud storage to ensure high availability and performance.
Query Processing
Our system dynamically filters and groups articles based on user queries, employing sophisticated algorithms to cluster similar articles and deliver highly relevant results swiftly and efficiently.
Custom Solutions
We continuously develop custom solutions tailored to the unique needs of our clients. This bespoke service is part of our commitment to delivering exceptional value and adapting to the unique challenges faced by our users. Here are some that we have built already.
Trusted by
the Top Leaders
OUR SOLUTIONS IN — ACTION
READY FOR
CUSTOM NEWS SOLUTIONS?
Drop your email and find out how our API delivers precisely what your business needs.