Ph.D. Theses

Joint Information Extraction

By Qi Li
Advisor: Heng Ji
April 7, 2015

Information extraction (IE) is a challenging and essential task in the area of natural language processing (NLP), and can be applied to a broad range of applications such as question answering, conversational language understanding, machine translation and many more. It aims to automatically identify important entity mentions and their interactions such as relations and events from unstructured documents. In the past decade, researchers have made significant progress in this area. Although many IE approaches employ a pipeline of many independent components, various dependencies in IE from multiple components, multiple documents, and multiple languages are pervasive. The ignorance of those dependencies in traditional approaches leads to inferior performance because of the fact that the local classifications do not talk to each other to produce coherent results, and more importantly, they are incapable of performing global inference. Therefore it is critical to devise cross-component, cross-document, and cross language joint modeling methods to further improve the performance of IE.

Taking entity mention extraction, relation extraction and event extraction as points of view, the main part of this thesis presents a novel sentence-level joint IE framework based on structured prediction and inexact search. In this new framework, the three types of IE components can be simultaneously extracted to alleviate error propagation problem. And we can make use of various global features to produce more accurate and coherent results. Experimental results on the ACE corpora show that our joint model achieves state-of-the-art performance on each stage of the extraction. We further go beyond sentence level and make improvement in cross-document setting. We use an integer-linear-programming (ILP) formulation to conduct cross-document inference so that many spurious results can be effectively filtered out based on the inter-dependencies over the facts from different places. Finally, to investigate the cross-lingual dependencies, we presents a CRF-based joint bilingual name tagger for parallel corpora, then demonstrate the application of this method to enhance name-aware machine translation.

Return to main PhD Theses page