Web Data Integration (HWS2014)

Data integration is one of the key challenges within most IT projects. Within the enterprise context, data integration problems arise whenever data from separate sources needs to be combined as the basis for new applications. Within the context of the Web, data integration techniques form the foundation for taking advantage of the ever growing number of publicly-accessible data sources and for enabling applications such as product comparison portals, location-based mashups, and entity search engines.  In the course, students will learn techniques for integrating and cleansing data from large sets of heterogeneous data sources. The course will cover the following topics:

  • Heterogeneity and Distributedness
  • The Data Integration Process
  • Structured Data on the Web
  • Web Data Formats
  • Schema Mapping and Data Translation
  • Identity Resolution
  • Data Quality Assessment
  • Data Fusion

The course consists of a lecture together with accompanying practical exercises.  In the exercises the participants will gather expertise in applying state of the art data integration techniques along the case study of a real-world Web data integration project. Students will work on their projects in teams and will report about the results of their projects in the form of a written report as well as an oral presentation.

Time and Location

  • Thursday, 13:45 bis 15:15. Building: B6, 23-25 Bauteil A, Room: A 101 (Starting: 4.9.2014)
  • Friday, 12:00 bis 13:30.Building: B6, 23-25 Bauteil A, Room: A 101 (Starting: 5.9.2014)

Registration

Instructor

Final mark

  • 50 % written exam
  • 50 % project work

Slides and Projects 

  1. Slides: Introduction and Course Outline 
  2. Slides: Types of Structured Data on the Web
  3. Slides: Web Data Formats - Part 1
  4. Slides: Web Data Formats - Part 2
  5. Slides: Schema Mapping
  6. Slides: Introduction to Student Projects, MapForce
  7. Slides: Identity Resolution
  8. Slides: Exercise: Identity Resolution, Eclipse project
  9. Slides: Data Quality Assessment and Data Fusion
  10. Slides: Exercise: Data Fusion, Eclipse project

 

Course Evaluation 

  • Results of the evaluation of the course by the participants from HWS2013.

Participation 

  • The course is open to students of the Master Business Informatics 
  • The course is restricted to 20 participants
  • Students can register by joining the ILIAS group

Requirements

  • Basic programming skills in Java are required for the exercise

Important dates

  • Submit your final report by Monday, December 1, 23:59
  • Project presentations on Thursday, December 4
  • Final exam on Monday, December 15, 9:00, room A5 B243

 Outline

Week Thursday Friday
4.9.2014 Lecture: Introduction to Web Data Integration Lecture:  Structured Data on the Web
11.9.2014 Lecture: Web Data Formats Lecture: Web Data Formats
18.9.2014 Lecture: Schema Mapping Lecture: Schema Mapping
25.9.2014 Exercise: Introduction to Student Projects Lecture: Introduction to MapForce
2.10.2014 Feedback about Project Outlines Public holiday
9.10.2014 Exercise: Data Translation Exercise: Data Translation
16.10.2014 Lecture: Identity Resolution Lecture: Identity Resolution
23.10.2014 Exercise: Identity Resolution Exercise: Identity Resolution
30.10.2014 Exercise: Identity Resolution Exercise: Identity Resolution
6.11.2014 Lecture: Data Quality and Data Fusion Lecture: Data Quality and Data Fusion
13.11.2014 Lecture: Data Quality and Data Fusion Exercise: Data Fusion
20.11.2014 Exercise: Data Fusion (Room changed: B6 A302!) Exercise: Data Fusion
27.11.2014 Exercise: Data Fusion Exercise: Data Fusion
4.12.2014 Presentation of project results ---

 Literature 

  1. AnHai Doan, Alon Halevy, Zachary Ives: Principles of Data Integration. Morgan Kaufmann, 2012.
  2. Ulf Leser, Felix Naumann: Informationsintegration. Dpunkt Verlag, 2007.
  3. Serge Abiteboul, et al: Web Data Management. Cambridge University Press, 2012.
  4. Jérôme Euzenat, Pavel Shvaiko: Ontology Maching. Springer, 2007.
  5. Felix Naumann: An Introduction to Duplicate Detection. Morgan & Claypool, 2012.
  6. Peter Christen: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.