Web Data Integration (HWS2017)

Data integration is one of the key challenges within most IT projects. Within the enterprise context, data integration problems arise whenever data from separate sources needs to be combined as the basis for new applications. Within the context of the Web, data integration techniques form the foundation for taking advantage of the ever growing number of publicly-accessible data sources and for enabling applications such as product comparison portals, location-based mashups, and data search engines.

In the course, students will learn techniques for integrating and cleansing data from large sets of heterogeneous data sources. The course will cover the following topics:

  1. Heterogeneity and Distributedness
  2. The Data Integration Process
  3. Structured Data on the Web
  4. Data Exchange Formats
  5. Schema Mapping and Data Translation
  6. Identity Resolution
  7. Data Quality Assessment
  8. Data Fusion

The course consists of a lecture together with accompanying practical exercises.  In the exercises the participants will gather expertise in applying state of the art data integration techniques along the case study of a real-world Web data integration project. Students will work on their projects in teams and will report about the results of their projects in the form of a written report as well as an oral presentation.

Time and Location

  • Wednesday, 15:30-17:00. Building: B6, Room: A 104 (Starting: 6.9.2017)
  • Thursday, 10:15-11:45. Building: B6, Room: A 104 (Starting: 7.9.2017)

Instructors

Final mark

  • 50 % written exam
  • 50 % project work

Slides  

  1. Slideset: Introduction and Course Organization
  2. Slideset: Types of Structured Data on the Web
  3. Slideset: Data Exchange Formats - Part 1
  4. Slideset: Data Exchange Formats - Part 2
  5. Exercise: Data Exchange Formats
  6. Slideset: Schema Mapping and Data Translation
  7. Slideset: Introduction to the Student Projects

Lecture Videos

  • Video recordings of the Web Data Integration lectures from HWS2015 are available here.

 Outline

Week Wednesday Thursday
6.9.2017 Lecture: Introduction to Web Data Integration Lecture:  Structured Data on the Web
13.9.2017 Lecture: Data Exchange Formats Lecture: Data Exchange Formats
20.9.2017 Lecture: Schema Mapping Lecture: Schema Mapping
27.9.2017 Exercise: Introduction to Student Projects Exercise: Introduction to MapForce
4.10.2017 Feedback about Project Outlines Project Work: Data Translation
11.10.2017 Project Work: Data Translation Project Work: Data Translation
18.10.2017 Lecture: Identity Resolution Lecture: Identity Resolution
25.10.2017 Exercise: Identity Resolution Project Work:Identity Resolution
1.11.2017 Project Work: Identity ResolutionProject Work: Identity Resolution
8.11.2017 Lecture: Data Quality and Data Fusion Lecture: Data Quality and Data Fusion
15.11.2017 Exercise: Data FusionProject Work: Data Fusion
22.11.2017 Project Work: Data Fusion  Project Work: Data Fusion
29.11.2017Project Work: Data Fusion Project Work: Data Fusion
6.12.2017 Presentation of project results Presentation of project results

Registration and Participation

  • The course is open to students of the Mannheim Master in Data Science and Master Business Informatics 
  • The course is restricted to 40 participants.
  • Registration will be opened Wednesday, August 30th 2017, 10:15 am.
  • Registration is done via ILIAS using this link (once the registration is open)
  • Allocation of places is done by FCFS (limit  40 students)

Requirements

  • Programming skills in Java are required for the exercise.

Course Evaluation 

Tools

Literature 

  1. AnHai Doan, Alon Halevy, Zachary Ives: Principles of Data Integration. Morgan Kaufmann, 2012.
  2. Ulf Leser, Felix Naumann: Informationsintegration. Dpunkt Verlag, 2007. (Free PDF Version)
  3. Luna Dong, Divesh Srivastava: Big Data Integration. Morgan & Claypool, 2015.
  4. Serge Abiteboul, et al: Web Data Management. Cambridge University Press, 2012.
  5. Jérôme Euzenat, Pavel Shvaiko: Ontology Maching. Springer, 2007.
  6. Felix Naumann: An Introduction to Duplicate Detection. Morgan & Claypool, 2012.
  7. Peter Christen: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.