Master Thesis: Design and Implementation of a Data Integration Extension for RapidMiner (Bizer/Lehmberg)

Data integration problems arise whenever data from separate sources needs to be combined as the basis for new applications. Within the context of the Web, data integration techniques form the foundation for taking advantage of the ever growing number of publicly-accessible data sources and for enabling applications such as product comparison portals, location-based mashups, and data search engines.

Many data integration solutions, however, require a high level of technical understanding from their users. There is currently no tool that allows a user to integrate datasets in an ad-hoc way and does not require a too deep knowledge of the underlying process.

In the area of data mining, the same situation existed and the data mining tool RapidMiner provided a solution with a graphical user interface and easy-to-use operators. Your task is to develop an extension for RapidMiner that contains operators for data integration tasks such as Identity Resolution, Schema Matching and Data Fusion. These operators should allow any user who is comfortable with using RapidMiner to perform a full data integration process. The implementation of the algorithms is provided by the framework used in the IE 670 Web Data Integration lecture.

The aim for this thesis is twofold:

  1. Develop a RapidMiner extension for Data Integration algorithms based on an existing Java framework that implements the algorithms

  2. Evaluate the extension with multiple use cases with respect to the data integration result and extension usability

Your skills:

For more information please contact Christian Bizer or Oliver Lehmberg.


[1] AnHai Doan, Alon Halevy, Zachary Ives: Principles of Data Integration. Morgan Kaufmann, 2012.
[2] Ulf Leser, Felix Naumann: Informationsintegration. Dpunkt Verlag, 2007.
[3] Luna Dong, Divesh Srivastava: Big Data Integration. Morgan & Claypool, 2015.
[4] Serge Abiteboul, et al: Web Data Management. Cambridge University Press, 2012.
[5] Jérôme Euzenat, Pavel Shvaiko: Ontology Matching. Springer, 2007.
[6] Felix Naumann: An Introduction to Duplicate Detection. Morgan & Claypool, 2012.
[7] Peter Christen: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012.
[8] S. Kirstein, S. Land, D. Halfkann RapidMiner 7 How to extend RapidMiner

[9] RapidMiner