Open Information Extraction: Current Approaches

How can a computer accumulate a massive body of knowledge to support Web search engines of the future?

REGISTRATION CLOSED - NO PLACES LEFT

To address the questions above, Open IE (IE = Information Extraction) projects have been developed extraction systems that read arbitrary text from any domain on the Web. The goal of these systems is to extract meaningful information that is stored in unified knowledge bases for efficient querying. In contrast to traditional information extraction, the Open Information Extraction paradigm attempts to overcome the knowledge acquisition bottleneck by extracting a large number of relations at once.

Course description

  • The seminar is open to both bachelor and master students.
  • The full list of topics will be online between by 14.09. Each topic will focus a paper that can be used as a starting point for further research.
  • First meeting is during the week 15.09-19.09 (concrete date will be fixed at the beginning of the lecture time).
  • To pass the seminar each student has to write a seminar paper that describes the topic, relevant approaches, experiments and results (master students 12-15 pages; bachelor students 10-12 pages)
  • A presentation will be given at the end of the semester in a block seminar (concrete date will be fixed in the first meeting.
  • Both the seminar paper and the presentation have to be written/given in English.
  • We will provide a Latex template for writing the seminar paper.

     

Participation

To participate in the course, please write an email to Dr. Christian Meilicke.

Seminar Topics

Below you will find a list of possible topics for the thesis and presentation (including the "starter" paper). We also welcome proposals from students.

Bachelor topics

  • Title: Foundations of Open Information Extraction.
    Content: What is IE? What is OIE? Clarify the differences and give precise definition. Give a an overview of existing OIE different systems/approaches and link them to the groups that are working on these issues. What is the difference between the first and the second generation of OIE. The resulting paper should be a survey paper that gives a good overview on the topic.
    Recommended Paper: Michele Banko et al.: Open Information Extraction from the Web. IJCAI, 2007.
  • Title: Identifying Knowledge in OIE Results
    Content: OIE systems generate in most cases triples as "arnold governor california" that are not semantified, i.e, it is unclear to which entity "arnold" refers to. By mapping these triples (and their parts) to knowledge bases as DBpedia, semantics (= meaning) are attached to the triples. One can call this process knowledge identification. The seminar paper that deals with this topic should clarify the challenge of knowledge identification and report about some approaches that are used to solve this challenge. One of these approaches should be explained in more details.
    Recommended Paper: Jay Pujara et al.: Knowledge Graph Identification. ISWC, 2013
  • Title: Never ending Language Learning (NELL)
    Content: "Read the Web" is a research project that attempts to create a computer system that learns over time to read the web. The system that is alos known as NELL created a large corpus of triples. This seminar work should present the main approach underlying NELL and some of the extensions that have been added in the last years.
    Link to Papers: http://rtw.ml.cmu.edu/rtw/publications

Master or Bachelor Topic

  • Title: Wikipedia and Open Information Extraction
    Content: Wikipedia (or DBpedia) is used in many different ways (different goals, different techniques) in the context of OIE. This paper should give an overview on the different ways how DBpedia is used in the context of OIE. It is thus required to start reading those papers that mention OIE and Wiki/DBpedia. Then it is required to classify the different tasks that are defined in that context and the different methods that are applied.
    Starting Paper: http://homes.cs.washington.edu/~weld/papers/wu-acl10.pdf

Master topics

  • Title: The Use of Markov Logic for Open Information Extraction
    Content: Markov Logic has been proposed as a technique for improving Open Informaton Extraction. This seminar paper should both explain the basic concepts of Markov Logic and its use in the context of OIE.
    Recommended Paper: Hoifung Poon, Pedro Domingos: Joint Inference in Information Extraction. AAAI, 2007
  • Title: Large-scale information extraction and Deep Learning
    Content: Recently, there has been a lot of interest in the NLP and Machine Learning community for so-called "deep-learning" architectures, namely approaches that are able to learn complex non-linear relationships on the basis of deep (e.g., recursive) neural network models. This seminar paper will focus on the application of deep learning methods for large-scale information extraction.
    Recommended Paper: Socher et al.: Reasoning With Neural Tensor Networks For Knowledge Base Completion. NIPS 2013
  • Title: Incompleteness in large scale Knowledge Bases
    Content: Knowledge bases created from web facts are often plagued with inconsistencies, resulting from extraction errors, ambiguity of terms, incorrect facts and more. These often makes them incomplete. Markov Logic networks can enable detect those using first order logic rules. In this seminar paper, we will explore ways how to formalize such errors through horn clauses and detect and eliminate such errors in large scale Knowledge bases.
  • Recommended Paper:  Chen et al.: Knowledge Expansion over Probabilistic Knowledge Bases..ICDM - 2014
  • Title: Information extraction with Matrix Factorization and Universal Schema
    Content: Recently, there have been proposals to populate knowledge bases with so-called universal schema - namely the union of all available schemas (surface form predicates as in OpenIE, and relations in the schemas of pre-existing databases) using matrix factorization methods. This seminar paper will look at ways to apply matrix factorization methods for wide-coverage information extraction.
  • Recommended Paper:  Riedel et al.: Relation Extraction with Matrix Factorization and Universal Schemas. HLT-NAACL 2013.

  • Title: Unsupervised Learning of Word Representations in the Vector Space
    Content: Recently, learning vector space representations for words from large text corpora have been proposed, that can capture the relationships between words surprising well. For example the male/female relationship is automatically learned, and thus, the vector operation “King - Man + Woman” results in a vector very close to “Queen.”
    Recommended Paper: 1) Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013 | 2) Jeffrey Pennington, Richard Socher and Christopher Manning. Glove: Global Vectors for Word Representation. In Proceedings of EMNLP 2014