Information Extraction using Machine Learning

Anushka N Sharma
AITS Journal
3 min readJul 25, 2019

--

With the amount of information available on the internet growing at phenomenal rate, research in improving the effectiveness and efficiency of information extraction and knowledge discovery has become crucial.

“Without data you are just another person with an opinion”

-W. Edwards Deming

Information extraction(IE) is concerned with applying natural language processing to automatically extract the essential details from text documents.

The process of information extraction, turns the unstructured information embedded in texts into structured data, for example for populating a relational database to enable further processing. It is a process which takes unseen texts as inputs and produces fixed-format, unambiguous data as output.

“I am the very model of a modern Major-General, I’ve information vegetable, animal, and mineral, I know the kings of England, and I quote the fights historical From Marathon to Waterloo, in order categorical… “

- Gilbert and Sullivan, Pirates of Penzance

Some of the tasks in information extraction are named entity recognition, relation extraction, event extraction. Let us learn more about them,

We begin with the first step in most IE tasks, finding the proper names or named entities in a text. The task of named entity recognition (NER) is to find each mention of a named entity in the text and label its type. Once all the named entities in a text have been extracted, they can be linked together in sets corresponding to real-world entities.

Next, we turn to the task of relation extraction: finding and classifying semantic relations among the text entities. These are often binary relations like child-of, employment, part-whole, and geospatial relations. Relation extraction has close links to populating a relational database.

Finally, we discuss about event extraction. Event extraction is finding events in which these entities participate. Event coreference is needed to figure out which event mentions in a text refer to the same event. The three tasks related to event extraction are temporal expression, temporal normalization, template filling.

Much of the work in information extraction deals with extracting information by training with a bag of words. The standard vector space model of text represents a document as a sparse vector that specifies a weighted frequency for each of the large number of distinct words or tokens that appear in a corpus. Some of the work deals with the usage of Hidden Markov Model (HMM) or Conditional Random Field (CRF). Some of the IE systems treat the text as a sequence of rm interpreted tokens, where the pattern matching technique or rule-based technique is adopted to retrieve the information.

IE application analyses texts and presents only the specific information from them that the user is interested in. IE systems are difficult and knowledge intensive to build and tied to particular domain and scenarios. Due to this, IE in Text mining is always limited to a particular corpora. A great disadvantage of current approach is that, their intrinsic dependence to the application domain and the target language. Several machine learning techniques have been applied in order to facilitate the portability of the information extraction systems.

As information extraction is limited to a particular corpora, the usage is limited. Also in extracting the aim, methodology and conclusion using the features, the features used for identifying aim and conclusion are same in different domain whereas the feature for methodology varies in different domain. This can be further implemented by considering different interrelated topics.

--

--