Text Mining Platform

Case Study

Main Idea

The idea was to create a pipeline that could pull articles from the Internet and give away the extracted knowledge. This pipeline consists of steps, or components. The extracted knowledge is saved into a database and then arranged for any possible use. Each component consumes data items of a certain type and produces data items of a certain type.

Main modules

1. Retrieval

Performs periodical and/or real-time retrieval of data from external data sources using available APIs as well as HTML content scraping techniques.

2. Processing

Analyses, categorises, and stores the retrieved data in a manner most efficient for presentation to the end users.

3. Presentation API

A set of RESTful services that allow the frontend to consume the available data on a per-user basis.

 

How we approached the project

After the facts and events are extracted, an event stream for each user is created and passed to his UI. Every component of the pipeline persists its results.

  • The pipeline components consume and produce the following types of data items:

    Seed phrases;

    URLs;

    HTML pages;

    Article texts;

    Named entities;

    Temporal expressions;

    Relations;

    Sentence boundaries;

    Themes;

    Objects;

    Facts/Events;

    Event stream.

  • The system consists of the following components:

    Search engine;

    Crawler;

    Text extractor;

    Named entity recognizer;

    Temporal expressions extractor;

    Relations extractor;

    Sentence detector;

    Theme detector;

    Named entity disambiguation;

    Fact extractor;

    Event merger;

    Stream builder.

  • As the system uses linguistic libraries that can produce errors, their output should be monitored and corrected if necessary and, if possible, library internal algorithms should be corrected to avoid errors in the future;

    Each component provides an estimate of confidence for each output data item. Based on this estimate, before proceeding to the next pipeline stage the data item can be passed to the curation UI.

  • There are several options for a curator:

    If curator approves the data item, it is passed to the next component;

    If curator corrects the data item, some adjustments are introduced to the processing component, and the data item is passed to the component again. After that curator is able to view the data items that were processed one more time and approve them again if the level of confidence is acceptable;

    Curator is able to throw away the output data item if it is totally incorrect. The data item is not passed to the next component of the pipeline.

Our Clients Say

Let Us Contact You

  • Fill out the form below and we'll get in touch within 24 hours