CS 715: Large-Scale Data Integration Seminar (FSS2018)

This seminar covers topics related to integrating data from large numbers of independent data sources. This includes large-scale schema matching, identity resolution, data fusion, set completion, data search, and data profiling.



In this seminar, you will
  • Read, understand, and explore scientific literature
  • Summarize a current research topic in a concise report (10-12 pages)
  • Give a presentation about your topic (before the submission of the report)


  • Attending the courses Web Data Integration and Data Mining I before the seminar is recommended
  • The report has to be written using Latex
  • Report and presentation have to be in English


  1. Select your preferred topics and register before February 16th
  2. Attend the kickoff meeting on February 27th at 10:00 in room B6 C1.01 (library room), Slides of the kickoff meeting.
  3. You will be assigned a mentor, who provides guidance and one-to-one meetings
  4. Work individually throughout the semester: explore literature, create a presentation, and write a report
  5. Give your presentation in a block seminar towards the end of the semester

Getting started

The following books are good starting points for getting an overview of the topic of large-scale data integration:

  • Dong/Srivastava: Big Data Integration. Morgan & Claypool, 2015.
  • Doan/Halevy: Principles of Data Integration. Morgan Kaufmann, 2012.


Explore the list of topics below and select at least 3 topics of your preference. Send a ranked list of your selected topics via email to anna(at)informatik.uni-mannheim.de until February 16, 2018. We will confirm your registration and assign you one of your preferred topics (if possible) in the week of February 19, 2018.

  1. Collective Instance Matching
    • Doan/Halevy: Principles of Data Integration. Pages 198ff, Morgan Kaufmann, 2012.
    • Christophides/Efthymiou/Stefanidis: Entity resolution in the web of data. Pages 55-72, Synthesis Lectures on the Semantic Web,2015.
  2. Holistic Schema Matching
    • Mulwad, Varish, Tim Finin, and Anupam Joshi: Semantic message passing for generating linked data from tables. International Semantic Web Conference. Springer, Berlin, Heidelberg, 2013.
    • He, Yeye, et al.: Automatic discovery of attribute synonyms using query logs and table corpora. Proceedings of the 25th International Conference on World Wide Web, 2016.
  3. Data Search for Table Extension
    • Yakout, et al.: InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables. SIGMOD 2012.
    • Bhagavatula, et al.: Methods for Exploring and Mining Tables on Wikipedia. KDD IDEA 2013.
  4. Truth Discovery for Knowledge Base Completion
    • Li, Gao: A Survey on Truth Discovery. KDD SIGKDD Explorations Newsletter, 2015.
    • Dong: Knowledge-based trust: estimating the trustworthiness of web sources. VLDB 2015.
  5. Wrapper Induction for Knowledge Base Completion
    • Bühmann L. et al.:Web-Scale Extension of RDF Knowledge Bases from Templated Websites. Proceedings of the International Semantic Web Conference (ISWC), 2014.
    • Furche, Tim, et al.: DIADEM: domain-centric, intelligent, automated data extraction methodology. Proceedings of the 21st International Conference on World Wide Web (WWW), 2012.
  6. Set Completion using Semi-Structured Web Data
    • Wang and Cohen: Language-independent set expansion of named entities using the web. Seventh IEEE International Conference on Data Mining (ICDM), 2007.
    • Lidong Bing, Wai Lam, and Tak-Lam Wong. 2013. Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In Proceedings of the sixth ACM international conference on Web search and data mining (WSDM), 2013.
  7. Query Strategies for Active Learning
    • Settles, Burr: "ctive learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6.1 (2012): 1-114.
    • Isele, Robert, Anja Jentzsch, and Christian Bizer: Active learning of expressive linkage rules for the web of data. Proceedings of the International Conference on Web Engineering, 2012.
  8. Profiling schema.org JobPosting Data on the Web
    • Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC), 2014.
    • Anna Primpeli: WebDataCommon Schema.org Data Extracted from November 2017 Common Crawl.
  9. Corporate Data Lakes
    • Alon Halevy, Flip Korn, Natalya F. Noy, et al.: Goods: Organizing Google's Datasets. SIGMOD, 2016.
    • I. Terrizzano, P. M. Schwarz, et al.: Data wrangling: The challenging journey from the wild to the lake. CIDR, 2015.
  10. Dataspace Profiling
    • Felix Naumann: Data profiling revisited. SIGMOD Rec. 42, 4, 2014.
    • Mohamed Ellefi, et al.: RDF Dataset Profiling - a Survey of Features, Methods, Vocabularies and Applications. Journal of Web Semantics, 2017.
 Students are free to suggest additional topics of their choice that are related to large-scale data integration.

Presentation Schedule

The final seminar presentations will take place on Friday 04.05 and Monday 07.05 in room  A305, B6 Building A.

We have assigned the following timeslots:

Collective Instance MatchingFriday, 04.05, 10:30 - 10:55
Holistic Schema MatchingFriday, 04.05, 10:55 - 11:20
Active Learning for Entity Resolution - BlockingFriday, 04.05, 11:20 - 11:45
Active Learning for Entity Resolution - Query StrategiesFriday, 04.05, 11:45 - 12:10
Data Search for Table ExtensionFriday, 04.05, 12:10 - 12:35
Dataspace ProfilingMonday, 07.05, 10:15 - 10:40
Wrapper Induction for Knowledge Base CompletionMonday, 07.05, 10:40 - 11:05
Set Completion using Semi-Structured Web DataMonday, 07.05, 11:05 - 11:30
Truth Discovery for Knowledge Base CompletionMonday, 07.05, 11:30 - 11:55